Three Steps to Successful Data Integration
This three-step approach will help you establish or reenergize your DI development projects.
- By John O'Brien
- April 10, 2012
[Editor’s Note:John O’Brien will be leading classes on emerging BI technologies, ETL architectures, and mobile BI at the TDWI World Conference in Chicago, May 6-11. Here he shares a three-step approach to establish or reenergize your data integration projects.]
A mainstream billion-dollar industry for nearly two decades, data integration (DI) is one of the core competencies and necessities of business intelligence (BI) programs and typically accounts for 60 to 70 percent of a BI budget. However, DI has historically lacked a management discipline of design and principles similar to the more familiar data management or data modeling practices.
This article outlines the universal, three-dimensional DI discipline that provides a simple and sustainable methodology for managing your ETL architecture. A solid DI framework and methodology must be simple, communicative, and scalable, and should work within the constructs of your choice of BI data and technology architectures. It has been proven that complex frameworks and methodologies, or those that lack a defined discipline, are unable to scale and do not contain the communication elements that guide successful growth.
Following this three-step approach will enable you to establish or reenergize your DI development projects.
- Step One: Define and communicate the DI framework
- Step Two: Apply this framework to the modeling process
- Step Three: Leverage this foundation to continuously and efficiently deliver DI projects as your DI architecture evolves and aligns with business priority and strategy
Step One: Defining and Communicating the Framework
The data integration framework you create is leveraged to implement your beliefs, principles, and requirements for your company’s BI platform. Although the building block for DI remains extract-transform-load (ETL), this should be expanded to include high-level categories of extracting, transforming, cleansing, and loading.
-- Extraction can be used for fetching flat files, acquiring data from shadow tables in a database, taking a snapshot of differencing files, receiving streaming data in pipes, or subscribing to a message bus. Extract can also be subcategorized and reusable by type of database.
-- Transformation subcategories (such as conforming for integration or splitting) help resolve multiple source data models of the same data into a single target data model. Derivations, aggregations, decoding, filtering, projecting, and summarizing are all typical transformation types that can be mapped into.
-- Cleansing should be identified separately because different architectures choose to implement cleansing rules in different ways and to different degrees. The high-level subcategories of cleansing should include technical data quality checks and business data quality checks. Technical cleansing deals with integration issues that arise from dealing with data type conversions and data-modeling integrity constraints, both inter- and intra-entity and relationships. Business data quality checks are specific and isolated to business rules that may be departmental in perspective or change over time with business processes. Avoid calling the “cleansing” integration category “quality” because quality is a very large discipline unto itself and is not truly within the context of data integration, which carries out transformation rules. There is data quality, code quality, requirements quality, and design quality that are not part of the data integration goals.
-- Loading is the final integration category and can include flat file SQL loaders of databases, record inserts directly into database tables, or publishing on an enterprise message bus. Loading can be influenced by your BI architecture principles and requirements. For example, a database independence or portability requirement that requires flat file loading enables real-time environments to suspend DI for database maintenance activities or interruptions of service.
Step Two: Applying the Framework to the Modeling Process
DI modeling is a powerful, straightforward, and familiar methodology that allows for more efficient and repeatable project management, improved communication, and higher quality standardized designs. Similar to data modeling, DI modeling follows the process of conceptual modeling, logical modeling, and physical modeling found in the Zachman Framework.
-- A conceptual model lays out the high-level data flows and integration across operational systems and the BI system’s architecture or data architecture. This diagram model shows data movements across physical data stores to the final target data store of a project. It identifies and isolates data integrations, such as unique data extraction technologies or techniques, and divides a BI project for several developers.
-- A logical model relies heavily on the DI framework’s categorical listing of data integration component types and sub-types. Benefits of logical DI modeling can be viewed as componentizing the data integration process and matching the data integration with the BI program’s adopted data and technology architectures.
-- A DI physical model becomes the design specification for the ETL developer. Any future developer needing to make modifications can quickly become familiar with the code by reviewing its associated logical data integration model. Understanding the logic for debugging purposes is also made easier.
Step Three: Patterns Over Time Evolve the DI Architecture
The DI architecture will evolve over time with the BI program, and the DI framework may be extended but not necessarily changed, as it’s the result of decisions made as you move from logical to implemented physical DI models. A key process that is the responsibility of lead architects or developer is to observe how the overall data warehouse architecture is evolving and to recognize patterns of repeatability and unique extensions for reusability.
Observing patterns across DI projects will help determine if you should convert to a meta-driven engine to increase code quality through high reusability. Integration patterns can be found in all data warehouse architectures over time. Data architectures become the source and target for integration building blocks that repeat between layers, such as staging, operational data store, enterprise normalized or dimensional warehouse, and information delivery or analytic data marts and sandboxes. Some companies refer to refactoring their integration architecture as the blueprint for agile data integration.
Thought leadership and white papers set the groundwork for creating databases and modeling tools. DI solution vendors, however, focused on building tools to move developers from hand-coding routines into environments with efficient visual user interfaces and a palette of pre-built components environments. Mastering the ability to create a strategic data integration framework and evolve it over time is a significant accomplishment. Implementing the framework and methodology in BI projects many times leads to a proficiency that cannot be learned any other way, as it requires a thorough understanding of business BI capabilities, requirements, and data architectures.
Data integration competency centers or centers of excellence (COE) are a great concept, but the reality is that most of us haven’t yet realized a formal DI COE. Keep in mind that the data integration framework and methodology is just as applicable in small to large enterprises or agile BI teams.