CASE STUDY - Operational ETL Provides Solutions for Past Data Quality Woes
By Chris Jennings, Senior Principal, Collaborative Consulting
An educational publishing company (EPC) wanted to replace its core systems, including finance, order management, e-commerce, and content management. Early in the initiative, it was determined that a key element for success would be a data quality program instituted as part of the data conversion and integration effort. Experience and research into the common causes of failure for large ERP implementations highlighted this as a substantial risk area.
A consistent theme that was echoed in the failed implementations was the inability to properly cleanse, consolidate, and restructure data as part of the data conversion and ongoing data integration processes. This finding was not surprising, as it is the point of failure for many business intelligence initiatives. These difficulties commonly result from one or both of the following problems: improper and complicated technology solutions, or a lack of focus on understanding the data properly. This case study focuses on the technology solution.
These changes have made the ETL tools more attractive for broad use in handling the majority of data integration tasks across the enterprise.
EPC chose to standardize on an ETL tool (Informatica) as the backbone of the data integration architecture. While many practitioners relate ETL tools with batch data warehouse architectures, ETL tools increasingly are used for operational systems integration efforts. The base transformation functionality in the leading ETL tools has become mature; therefore, major changes in transformation functionality from release to release are not common. This has allowed vendors to grow in other areas, such as SOA enablement and stronger integration with data quality technologies.
These changes have made the ETL tools more attractive for broad use in handling the majority of data integration tasks across the enterprise. EPC is utilizing the base Informatica ETL tool PowerCenter with the real-time and Web services options, Informatica Data Quality, and PowerExchange for Oracle E-Business Suite.
The ETL backbone envisioned by EPC for the first release of the ERP project satisfies the required architecture functions. It was quickly evident that the functions in an operational environment are not significantly different from those in a data warehouse environment. Here are some examples of the functions:
- Flag, track, and fix errors
- Cleanse name and address data
- Schedule jobs
- Read and write from heterogeneous sources and targets
- Translate source systems codes into standardized codes
- Handle change data capture
- Transformation to restructure data
While the necessary functions are the same, the implementations are different from those in a data warehouse. Here are a few examples of these differences:
Error records in a data warehouse are flagged and tracked but are seldom handled or fixed because they do not have significant impact. Based on aggregate analysis, a single record missing is often statistically inconsequential. In an operational system, this is not true. For example, if an incoming order record fails because a product cannot be found, there is a customer waiting for the product to be shipped. The error must be corrected in a timely fashion.
Addresses in a data warehouse often are cleansed in batch. At EPC, the operational system needs them cleansed in real time. The SOA enablement of the ETL technology has allowed EPC to utilize the ETL backbone for this function.
In data warehousing environments, the ETL tool or enterprise scheduling packages are used to execute large batches with many jobs having extensive dependencies. These batches are often run overnight. For EPC, we are using the ETL tool to integrate data from multiple applications using smaller batches that run frequently throughout the day.