FAQ: Next Generation Data Integration
A few days ago, I presented a TDWI Webinar based on my newly published TDWI Best Practices report about “Next Generation Data Integration” (NGDI). Almost three hundred people attended the broadcast, and (with such a large turnout) I got a ton of great questions from the audience about data integration (DI).
I’d like to share some of those questions with you (and my responses to Webinar attendees who asked them), as a way of expanding and clarifying the research findings of the report. If you care about DI, this should be interesting for you.
Concerning bulk upload, should we use a batch upload mechanism or Web services?
It depends on the dataset being bulk loaded. You should stick to your old reliable bulk loader for datasets that are very large, too large for a service bus, don’t have an immediate delivery requirement, or demand multiple complex passes (as many multidimensional structures do, when being loaded into a data warehouse). Most services, messages, or events used in a DI context handle time-sensitive data, which is delivered faster over a message or service bus. Also, real-time DI often enables Operational Business Intelligence (OpBI), where data is drawn frequently from ERP, CRM, and other operational applications, then loaded into a warehouse, mart, or other BI data store. OpBI may also use DI to publish improved data back to those applications. Many operational applications (especially SAP) are best extracted from via the application layer, and services and messages usually support such an interface. From these examples, you can see that the old (bulk loaders) and the new (services) intermingle in the newest DI generation.
Do staging tables play an important role in DI?
Yes. The newest generation of DI still relies of older, tried-and-true designs and DI architectures. And these typically have a variety of data landing and data staging areas, including databases (like operational data stores) and tables (whether physically in the data warehouse or external to it). One new spin on this is that 64-bit computing and very large memory spaces in server hardware now enable more effective DI pipes. This is where data is staged and processed in server memory, not landed to disk. This both speeds up DI transformational processing and boosts scalability for large data volumes. For many organizations, NGDI is about adjusting (not abandoning) useful best practices like this to take advantage of newly available platform capabilities.
Is DI architecture and information architecture the same thing?
No, they’re different. Information architecture is usually about the data models and schema within individual enterprise databases, plus data dependencies across multiple ones. DI architecture concerns the design of data flows, plus development standards (like preferred interfaces for specific applications). For DI, hub-and-spoke is the most common architecture, where a vendor’s DI tool or a control server (in home-grown DI solutions) is equivalent to a hub. But point-to-point interfaces still abound in DI jobs, and DI over a bus is subject to whatever the bus requires. My report explains that designing and using just the right DI architecture has become a critical success factor for satisfying next-generation requirements, like scalability, real time, governance, and DI team collaboration.
Where do you see ERP choices within the context of NGDI?
In my world, Operational Business Intelligence (OpBI) has become quite common. OpBI requires much from a DI tool. The DI tool has to support feature-rich interfaces to ERP and other application types. The DI tool must have optimization to draw data fast, frequently, and non-invasively from ERP modules and applications. And the DI tool must understand ERP data structures and function calls to make sense of ERP data, before integrating it elsewhere. OpBI and other real-time business practices wouldn’t be possible without real-time DI. In fact, my report shows that various real-time DI functions are the ones users will increase the use of most over the next three years.
Other common DI practices involving ERP include synchronizing customer data (and other data domains, especially product data) across multiple ERP modules and instances. Synchronizing reference data is a similar practice, one that’s growing quickly. Since some ERPs are almost impermeable, DI is regularly called in to assist with data access for data quality. This kind of coordination between DI and DQ is one of the hallmarks of NGDI.
Do you think certain aspects of traditional EAI are going to be part of NGDI?
Well, first of all, I regularly find some DI functions executed over EAI and similar buses in user organizations that have already made a substantial investment in a robust EAI infrastructure. Firms in financial and insurance industries are typical examples. Second, I think what’s happening in such firms is that DI is simply leveraging more deeply an existing infrastructure, just as other users, applications, and tools are. Third, DI is being driven to EAI, in situations where EAI has better interfaces (especially to packaged applications) or certain time-sensitive data has a real-time requirement (for which EAI messages are easily configured). Even so, there’s still a need for standard data interfaces over the enterprise LAN.
Any metrics around how much operational cost is associated with near real-time data integration vs the traditional batch model?
Ten years ago, real-time DI via EAI was possible, but it usually required the purchase of extra tools. Plus, real-time functions in tools and applications weren’t very robust, so an administrator had to watch and tweak them constantly. These two characteristics drove up the cost. Luckily, a lot of RT functionality is built into today’s applications, databases, and DI tools. Many firms have a robust EAI or service bus infrastructure that DI can tap for real time. For firms that have kept their enterprise software and infrastructure up-to-date, real time DI is quite accessible, reliable, and inexpensive, as compare to the recent past. But that’s with EAI in mind. From a different direction, batch processing has improved, too. It may be preferred in the form of so-called micro-batches for frequent intra-day extract that needn’t be truly RT.
Can you expand on RT event processing, including contexts for applicability?
You probably don’t want to handle just any kind of event via a DI tool. Instead, some kind of “complex event” benefits from DI processing. A complex event is actually multiple events, typically occurring at different times (even different months or years) that need to be correlated. ETL-ish DI can access the many diverse data sources and data models where complex data events may be managed. Today, I almost exclusively find federal intelligence or security agencies doing this, to recognize and quantify security threats. The TSA and Coast Guard come to mind. But it’s just a matter of time before such DI-enabled practices are common with customer events in for-profit corporations.
If you have a question or answer about Next Generation Data Integration (or a reaction to one presented above), please share them by responding to this blog.
Register for and replay the TDWI Webinar these questions came from at
Download a free copy of the TDWI Best Practices Report titled Next Generation Data Integration, at http://tdwi.org/research/list/tdwi-best-practices-reports.aspx
Find tweets about NGDI by searching Twitter.com for the hash tag #NGDI.
Posted by Philip Russom, Ph.D. on April 19, 2011