Top Tips for Effective Data Integration
A new report breaks the hard problem of data integration down into an easily-digestible checklist of ten DI recommended practices.
- By Stephen Swoyer
- June 16, 2010
A new report from Philip Russom, senior manager for TDWI Research, breaks the hard problem of data integration (DI) down into an easily-digestible checklist of ten DI recommended practices.
At the top of Russom's list: the importance of recognizing (and understanding) DI diversity. It might seem obvious, but -- in an industry where ETL is still used interchangeably with DI and deduplication is synonymous with data quality (DQ) -- it doesn't get said often enough.
For this reason, Russom stresses, there's no silver bullet for "perfect" or "ideal" DI. Data integration is, instead, a diverse discipline that makes use of a number of tools depending on one's business and technological requirements.
By the same token, no single technology can claim to encapsulate DI. For a long time, extraction, transformation, and loading (ETL) circumscribed both the practice and the limits of DI. (This confusion persisted in spite of the fact that some of the biggest ETL players -- IBM Corp., Informatica Corp., Oracle Corp., and SAS Institute Inc. -- started marketing full-blown DI suites more than half a decade ago.)
Enterprisewide DI necessarily involves different domains, however; in this respect, a data warehouse administrator's perspective on prospective DI problems tends to look very different from that of a database administrator.
"Upon hearing the term data integration, people may think only of the technique they encounter most," Russom writes, invoking the example of ETL as a case in point. "[D]atabase administrators may think of replication and data synchronization [as primary DI tools], which are common in their work. And DI specialists who perform business-to-business data exchange may think of flat files communicated over file transfer protocol or electronic data interchange. In truth, all these techniques -- and others -- fit under the broad DI umbrella."
Shades of Gray
It used to be that DI had discrete analytic and operational aspects. Things are less clear-cut today. DI not only spans both analytics and operations but comprises a kind of hybrid category -- think of the overlap or intersection in a Venn diagram -- that Russom dubs Hybrid Data Integration.
Analytic DI, for example, deals with traditional data warehouse-driven activities; operational DI describes the practice of connecting to and integrating data from operational applications and databases. (This was traditionally the province of enterprise application integration, or EAI, solutions.) Hybrid DI, finally, involves interstitial practices -- such as master data management (MDM) -- that belong to both domains. Shops need to be mindful of the distinctions, says Russom.
In other words, there are no longer specific tools (ETL for analytic DI, EAI for operational DI) for specific practices. "Across these practices, any DI technique and tool type may be used and all practices assume core skills for databases, data models, interfaces, and transformations," he points out.
In Praise of Autonomous DI
Elsewhere, Russom stresses, shops need to treat data integration as an autonomous discipline. It's no longer appropriate to think of DI as a subset of data warehousing, for example; organizations are increasingly obtaining excellent results by reorganizing their DI practices as distinct or discrete entities.
"DI can still be practiced successfully when subsumed by a larger team. However, some organizations are moving toward independent teams of DI specialists who perform a wide range of DI work, whether analytic, operational, or hybridized," Russom explains. Similarly, he describes the Data Integration Competency Center -- which some vendors (such as Informatica Corp.) have been talking up for more than a decade -- as the "epitome" of autonomous data integration.
"Hundreds have sprung up in the last decade as shared-services organizations for staffing all DI work -- not just [Analytic DI] for DW," Russom writes, adding "this is a time of great change for DI, and now's the time to plan for DI's future."
T is for Transform
Although shops need to be careful not to confuse DI with specific DI technologies (such as ETL), they likewise need to avoid another trap: integrating the status quo. "True DI," Russom writes, "is about transforming data."
In a lot of cases, he argues, data needs to be transformed so that it can be effectively repurposed. (Replication or synchronization between and among databases is an obvious exception to this claim, Russom concedes.)
"It's not just the technicalities of transforming data from one data model and shoehorning it into another. Equally important is the fact that the source and target IT systems involved in this kind of data transfer serve different business purposes," he writes. "[T]ransforming data is a technical task that supports a business goal -- namely, repurposing data for a business use that differs from the one for which the data originated."
DataFlux president Tony Fisher made a similar point during a sit-down interview at February's TDWI Winter World Conference in Las Vegas. DataFlux had just unveiled its inaugural Data Integration suite, an offering that incorporates ETL and other DI technologies from its SAS parent company. (SAS, for the record, sponsored Russom's report.)
Many shops think of DI as an onerous requirement -- as, in fact, a "cost center," Fisher conceded. "The reality is it's an opportunity. I know that gets said a lot -- that a requirement or disruption is actually an 'opportunity' -- but in this case, it's true," he argued. "[Data integration] gives you an opportunity to correct [data], to cleanse [data], to reconcile [data], to expose data that's previously been siloed [in applications or sources]. It's a chance to build in [business] rules that are applicable across all your environments. It is absolutely an opportunity."
The Greening of DI
People don't typically think of DI as an eco-friendly or eco-fostering practice. To the extent that DI is a vital component of any system, application or database consolidation effort, however, it has an indisputable green aspect.
"Although a few application and database consolidations may be executed by simply copying data from source A to target B, the vast majority require that data be transformed and cleansed to better fit the target," Russom writes.
There are several scenarios, too, in which DI might actually be preferable to server virtualization, he continues. "Virtualization is most often applied to application servers, which can be collocated and configured in a straightforward manner to share common memory space and other hardware resources," Russom points out. "Data servers are a different matter entirely, because all are designed to seize every scrap of hardware resource. For this reason, the virtualization of multiple data servers is unlikely to yield desirable results." For this reason, he concludes, "database consolidations and collocations -- which are best accomplished with DI techniques -- are preferred over true virtualization."