Data Integration: Using Extract Transform and Load, Enterprise Architecture InitiativesEAI, and Enterprise Information Integration Tools to Create an Integrated Enterprise (Report Excerpt)
By Colin White, President, BI Research
This report is a sequel to TDWI’s 2003 report Evaluating ETL and Data Integration Platforms. The objective of the present report is to look at how data integration techniques, technologies, applications, and products have evolved since the 2003 report was published. The focus this time is not only on the role of data integration in data warehousing projects, but also on developing an enterprisewide data integration strategy.
The Challenges of Data Integration
Integrating disparate data has always been a difficult task, and given the data explosion occurring in most organizations, this task is not getting any easier. Over 69% of respondents to our survey rated data integration issues as either a very high or high inhibitor to implementing new applications. The three main data integration issues (see Figure 1) listed by respondents were data quality and security, lack of a business case and inadequate funding, and a poor data integration infrastructure.
Top Data Integration Issues
Figure 1. The top inhibitors to the success of data integration projects. Respondents were asked to select up to three. Based on 672 respondents.
Characteristics of Data Integration
Data integration involves a framework of applications, techniques, technologies, and products for providing a unified and consistent view of enterprise business data (see Figure 2).
- Applications are custom-built and vendor-developed solutions that utilize one or more data integration products.
- Products are off-the-shelf commercial solutions that support one or more data integration technologies.
- Technologies implement one or more data integration techniques.
- Techniques are technology-independent approaches for doing data integration.
Data Integration Framework
Figure 2. Components of a data integration solution.
Following is a review of the techniques and technologies used in data integration projects.
Data Integration Techniques
There are three main techniques used for integrating data: consolidation, federation, and propagation.
Data Consolidation captures data from multiple source systems and integrates it into a single persistent data store. This data store may be used for reporting and analysis as in data warehousing, or it can act as a source of data for downstream applications as in an operational data store.
With data consolidation, there is usually a delay, or latency, between the time updates occur in source systems and the time those updates appear in the target store. Depending on business needs, this latency may be a few seconds, several hours, or many days. The term near real time is often used to describe target data that has a low latency of a few seconds, minutes, or hours. Data with zero latency is known as real-time data, but this is difficult to achieve using data consolidation.
Data Federation provides a single virtual view of one or more source data files. When a business application issues a query against this virtual view, a data federation engine retrieves data from the appropriate source data stores, integrates it to match the virtual view and query definition, and sends the results to the requesting business application. By defi- nition, data federation always pulls data from source systems on an on-demand basis. Any required data transformation is done as the data is retrieved from the source data files. Enterprise information integration (EII) is an example of a technology that supports a federated approach to data integration.
Data Propagation applications copy data from one location to another. These applications usually operate online and push data to the target location; i.e., they are event-driven. Updates to a source system may be propagated asynchronously or synchronously to the target system. Synchronous propagation requires that updates to both source and target systems occur in the same physical transaction. Regardless of the type of synchronization used, propagation guarantees the delivery of the data to the target. This guarantee is a key distinguishing feature of data propagation. Most synchronous data propagation technologies support a two-way exchange of data between a data source and a data target. Enterprise application integration (EAI) and enterprise data replication (EDR) are examples of technologies that support data propagation.
A Hybrid Approach. The techniques used by data integration applications will depend on both business and technology requirements. It is quite common for a data integration application to use a hybrid approach that involves several data integration techniques.
Data Integration Technologies
A wide range of technologies are available for implementing the data integration techniques outlined above. This section reviews four of the main ones: extract, transform, and load (ETL); enterprise information integration (EII); enterprise application integration (EAI); and enterprise data replication (EDR). Master data management (MDM) and customer data integration (CDI), which are really data integration applications, are also discussed because they are often thought of as data integration technologies.
Extract, Transform, and Load
As the name implies, ETL technology extracts data from source systems, transforms it to satisfy business requirements, and loads the results into a target destination. Sources and targets are usually databases and files, but they can also be other types of data stores such as a message queue. ETL supports a consolidation approach to data integration.
Data can be extracted in schedule-driven pull mode or event-driven push mode. Both modes can take advantage of changed data capture. Pull mode operation supports data consolidation and is typically done in batch. Push mode operation is done online by propagating data changes to the target data store.
Data transformation may involve data record restructuring and reconciliation, data content cleansing, and/or data content aggregation. Data loading may cause a complete refresh of a target data store or may be done by updating the target destination. Interfaces used here include de facto standards like ODBC, JBDC, JMS, for example, or native database and application interfaces.
In our survey, 57% of respondents rated their batch ETL usage as high (see Figure 3). Adding a medium rating to the result increases the figure to 81%. The survey also asked what the likely usage of batch ETL will be in two years. The result was 58% for high usage, and 82% for high and medium. As expected, these figures demonstrate that the batch ETL market has flattened out because most organizations use it.
ETL Use in Organizations
Figure 3. Batch ETL use is flat, but changed data capture and online ETL use will grow over the next two years. Based on 672 respondents.
The picture changes when looking at the growth figures for changed data capture (CDC) and online ETL operations. Our survey shows 16% of respondents rated their usage of CDC in ETL today as high. This number grows to 36% in two years. The equivalent figures for online ETL (called real-time or tricklefeed ETL in the survey) were 6% and 23%, respectively. These growth trends are due primarily to shrinking batch windows and the increasing need for lowlatency data. It is interesting to note that combining the high and medium usage figures for the two-year projection of online ETL gives a result of 55%. This clearly shows the industry is moving from batch to online ETL usage.
Enterprise Information Integration
EII provides a virtual business view of dispersed data. This view can be used for demand-driven query access to operational business transaction data, a data warehouse, and/or unstructured information. EII supports a data federation approach to data integration.
The objective of EII is to enable applications to see dispersed data as though it resided in a single database. EII shields applications from the complexities of retrieving data from multiple locations, where the data may differ in semantics and formats, and may employ different data interfaces.
Distinguishing features to look for when evaluating EII products include the data sources and targets supported (including Web services and unstructured data), transformation capabilities, metadata management, source data update capabilities, authentication and security options, performance, and caching.
In our survey, 5% of respondents rated their EII use as high (see Figure 4). Adding a medium rating to the result increases the figure to 19%. These figures grow to 22% and 52% respectively in two years, indicating considerable interest in exploiting EII technology in the future.
EII Use in Organizations
Figure 4. EII use is low at present but its usage is likely to grow rapidly. Based on 672 respondents.
Enterprise Application Integration
EAI integrates application systems by allowing them to communicate and exchange business transactions, messages, and data with each other using standard interfaces. It enables applications to access data transparently without knowing its location or format. EAI is usually employed for real-time operational business transaction processing. It supports a data propagation approach to data integration.
The direction of the EAI industry is toward the use of an enterprise service bus (ESB) that supports the interconnection of legacy and packaged applications, and also Web services that form part of a service-oriented architecture (SOA).
From a data integration perspective, EAI can be used to transport data between applications and to route real-time event data to other data integration applications like an ETL process. Access to application sources and targets is done via Web services, Microsoft .NET interfaces, Java-related capabilities such as JMS, legacy application interfaces and adapters, etc.
EAI is designed to propagate small amounts of data from one application to another. This propagation can be synchronous or asynchronous, but is nearly always done within the scope of a single business transaction. In the case of asynchronous propagation, the business transaction may be broken down into multiple physical transactions. An example would be a travel request that is broken down in separate but coordinated airline, hotel, and car reservations.
Data transformation and metadata capabilities in an EAI system are focused toward simple transaction and message structures, and they cannot usually support the complex data structures handled by ETL products. In this regard, EAI does not compete with ETL.
In our survey, 9% of respondents rated their EAI usage as high (see Figure 5). Adding a medium rating increases the figure to 29%. These figures grow to 26% and 58% respectively in two years. It is important to point out that the question relates to the use of EAI for data integration, as opposed to the use of EAI in the organization overall. The two-year EAI projection of 58% is consistent with the 55% growth figure for online ETL use mentioned earlier. This suggests that organizations see the need to merge the event-driven benefits of EAI with the transformation and consolidation power of ETL.
EAI Use in Organizations
Figure 5. EAI growth is consistent with the growth in online ETL use shown in Figure 3. This suggests the two technologies will be used together. Based on 672 respondents.
This study looked at data integration approaches across a wide range of different companies and applications. The study results show that these companies fall into two main groups:
- Large organizations that are moving toward building an enterprisewide data integration architecture. These companies typically have a multitude of data stores and large amounts of legacy data. They focus on buying an integrated product set and are interested in leadingedge data integration technologies. These organizations also buy highperformance best-of-breed products that work in conjunction with mainline data integration products to handle the integration of large amounts of data. They are also more likely to have a data integration competency center.
- Medium-sized companies that are focused on data integration solely from a business intelligence viewpoint and who evaluate products from the perspective of how well they will integrate with the organization’s BI tools and applications. These companies often have less legacy data, and are less interested in leading-edge approaches such as right-time data and Web services.
In evaluating and applying the contents of this report, it is important to understand which of the two categories your company fits into, and thus how sophisticated a data integration environment your company needs. Nonetheless, many of the ideas and concepts presented in this report apply equally to all companies, regardless of size. The main message of this report is that data integration problems are becoming a barrier to business success, and your company must have an enterprisewide data integration strategy if it is to overcome this barrier.
Colin White is the founder of BI Research. With over 35 years of IT experience, he has consulted for dozens of companies throughout the world and is a frequent speaker at leading IT events. Colin has written numerous articles on business intelligence and enterprise business integration, and publishes an expert channel and a blog on the Business Intelligence Network. He can be reached at [email protected].
This article was excerpted from the full November 2005 report. TDWI appreciates the sponsorship of Business Objects, Collaborative Consulting, DataFlux, DataMirror Corporation, IBM Information Integration Solutions, Informatica Corporation, SAP America, Sunopsis, and Syncsort Incorporated.
To download the full report, visit www.tdwi.org/research/reportseries.
Back to Table of Contents