Introduction to Next Generation Data Integration
By Philip Russom
Data integration (DI) has undergone an impressive evolution in recent years. Today, DI is a rich set of powerful techniques, including ETL (extract, transform, and load), data federation, replication, synchronization, changed data capture, data quality, master data management, natural language processing, business-to-business data exchange, and more. Furthermore, vendor products for DI have achieved maturity, users have grown their DI teams to epic proportions, competency centers regularly staff DI work, new best practices continue to arise (such as collaborative DI and agile DI), and DI as a discipline has earned its autonomy from related practices such as data warehousing and database administration.
To help user organizations understand and embrace all that next generation data integration (NGDI) now offers, this report catalogs and prioritizes the many new options for DI. This report literally redefines data integration, showing that its newest generation is an amalgam of old and new techniques, best practices, organizational approaches, and home-grown or vendor-built functionality. The report brings readers up to date by discussing relatively recent (and ongoing) evolutions of DI that make it more agile, architected, collaborative, operational, real-time, and scalable. It points to new platforms for DI tools (open source, cloud, SaaS, and unified data management) and DI’s growing coordination with related best practices in data management (especially data quality, metadata and master data management, data integration acceleration, data governance, and stewardship). The report also quantifies trends among DI users who are moving into a new generation, and it provides an overview of representative vendors’ DI tools.
The goal is to help users make informed decisions about which combinations of DI options match their business and technology requirements for the next generation. But the report also raises the bar on DI, under the assumption that a truly sophisticated and powerful DI solution will leverage DI’s modern best practices using up-to-date tools.
Ten Rules for Next Generation Data Integration
Data integration has evolved and grown so fast and furiously in the last 10 years that it has transcended ancient definitions. Getting a grip on a modern definition of DI is difficult, because “data integration” has become an umbrella term and a broad concept that encompasses many things. To help you get that grip, the 10 rules for next generation data integration listed on the next page provide an inventory of techniques, team structures, tool types, methods, mindsets, and other DI solution characteristics that are desirable for a fully modern next generation DI solution. Note that the list is a summary that helps you see the newfound immensity of DI; the rest of the report will drill into the details of these rules.
Admittedly, the list of 10 rules is daunting because it’s thorough. Few organizations will need or want to embrace all of them; you should pick and choose according to your organization’s requirements and goals. Even so, the list both defines the new generation of data integration and sets the bar high for those pursuing it.1
- DI is a family of techniques. Some data management professionals still think of DI as merely ETL tools for data warehousing or data replication utilities for database administration. Those use cases are still prominent, as we’ll see when we discuss TDWI survey data. Yet, DI practices and tools have broadened into a dozen or more techniques and use cases.
- DI techniques may be hand coded, based on a vendor’s tool, or both. TDWI survey data shows that migrating from hand coding to using a vendor DI tool is one of the strongest trends as organizations move into the next generation. A common best practice is to use a DI tool for most solutions, but augment it with hand coding for functions missing from the tool.
- DI practices reach across both analytics and operations. DI is not just for data warehousing (DW). Nor is it just for operational database administration (DBA). It now has many use cases spanning across many analytic and operational contexts, and expanding beyond DW and DBA work is one of the most prominent generational changes for DI.
- DI is an autonomous discipline. Nowadays, there’s so much DI work to be done that DI teams with 13 or more specialists are the norm; some teams have more than 100! The diversity of DI work has broadened, too. Due to this growth, a prominent generational decision is whether to staff and fund DI as is, or to set up an independent team or competency center for DI.
- DI is absorbing other data management disciplines. The obvious example is DI and data quality (DQ), which many users staff with one team and implement on one unified vendor platform. A generational decision is whether the same team and platform should also support master data management, replication, data sync, event processing, and data federation.
- DI has become broadly collaborative. The larger number of DI specialists requires local collaboration among DI team members, as well as global collaboration with other data management disciplines, including those mentioned in the previous rule, plus teams for message/service buses, database administration, and operational applications.
- DI needs diverse development methodologies. A number of pressures are driving generational changes in DI development strategies, including increased team size, operational versus analytic DI projects, greater interoperability with other data management technologies, and the need to produce solutions in a more lean and agile manner.
- DI requires a wide range of interfaces. That’s because DI can access a wide range of source and target IT systems in a variety of information delivery speeds and frequencies. This includes traditional interfaces (native database connectors, ODBC, JDBC, FTP, APIs, bulk loaders) and newer ones (Web services, SOA, and data services). The new ones are critical to next generation requirements for real time and services. Furthermore, as many organizations extend their DI infrastructure, DI interfaces need to access data on premise, in public and private clouds, and at partner and customer sites.
- DI must scale. Architectures designed by users and servers built by vendors need to scale up and scale out to both burgeoning data volumes and increasingly complex processing, while still providing high performance at scale. With volume and complexity exploding, scalability is a critical success factor for future generations. Make it a top priority in your plans.
- DI requires architecture. It’s true that some DI tools impose an architecture (usually hub and spoke), but DI developers still need to take control and design the details. DI architecture is important because it strongly enables or inhibits other next generation requirements for scalability, real time, high availability, server interoperability, and data services.
Why Care About NGDI Now?
Businesses face change more often than ever before. Recent history has seen businesses repeatedly adjusting to boom-and-bust economies, a recession, financial crises, shifts in global dynamics or competitive pressures, and a slow economic recovery. DI supports real-world applications and business goals, which are affected by economic issues. Periodically, you need to adjust DI solutions to align with technical and business goals for data.
The next generation is an opportunity to fix the failings of prior generations. For example, most lack a recognizable architecture, whereas achieving next generation requirements—especially real time, data services, and high availability—requires a modern architecture. Older ETL solutions, in particular, are designed for serial processing, whereas they need to be redesigned for parallel processing to meet next generation performance requirements for massive data volumes.
Some DI solutions are in serious need of improvement or replacement. For example, most DI solutions for business-to-business (B2B) data exchange are legacies, based on low-end techniques such as hand coding, flat files, and file transfer protocol (FTP). These demand a serious makeover—or rip and replace—if they’re to bring modern DI techniques into B2B data exchange. Similar makeovers are needed with older data warehouses, customer data hubs, and data sync solutions.
Even mature DI solutions have room to grow. Successful DI solutions mature through multiple lifecycle stages. In many cases, NGDI focuses on the next phase of a carefully planned evolution.
For many, the next generation is about tapping more functions of DI tools they already have. For example, most DI platforms have supported data federation for a few years now, yet only 30% of users have tapped this capability. Also to be tapped are newer capabilities for real time, micro-batch processing, changed data capture (CDC), messaging, and complex event processing (CEP).
Unstructured data is still an unexplored frontier for most DI solutions. Many vendor DI platforms now support text analytics, text mining, and other forms of natural language processing. Handling non-structured and complex data types is a desirable generational milestone in text-laden industries such as insurance, healthcare, and federal government.
DI is on its way to becoming IT infrastructure. For most organizations, this is a few generations away. But you need to think ahead to the day when data integration infrastructure is open and accessible to most of the enterprise the way that local area networks are today. Evolving DI into a shared infrastructure fosters business integration via shared data.
DI is a growing and evolving practice. More organizations are doing more DI, yet staffing hasn’t kept pace with the growth. And DI is becoming more autonomous every day. You may need to rethink the headcount, skill sets, funding, management, ownership, and structure of DI teams.1
For a similar list with more details, see TDWI Checklist Report: Top Ten Best Practices for Data Integration
, available on tdwi.org
Philip Russom is a research director at The Data Warehousing Institute (TDWI), where he oversees many of TDWI’s research-oriented publications, services, and events. Prior to joining TDWI in 2005, Russom was an industry analyst covering BI at Forrester Research and Giga Information Group. He has also run his own business as a BI consultant and independent analyst, plus served as a contributing editor to leading data management magazines. You can reach him at [email protected].
This article was excerpted from the full, 32-page report, Next Generation Data Integration. You can download this and other TDWI Research free at tdwi.org/bpreports.
The report was sponsored by DataFlux, IBM, Informatica, SAP, Syncsort, and Talend.
This article originally appeared in the issue of .