August 4, 2011
Frequently Asked Questions about Agile Data Integration
Topic: Data Integration Agility
Over the last couple of years, many BI professionals have adjusted their development practices to make them more agile in response to business demands for faster BI development. We're now experiencing a similar interest from data management professionals, who are looking into how they can make their work more agile. Here are some questions I've heard asked repeatedly about agile methods as they relate to data integration (DI), along with my answers.\
What are the agile issues around data integration?
First, business intelligence (BI) and data warehousing (DW) include many different tasks and skills. Some of these adapt to agile methods more easily than others. For example, report design and creation is one of the most agile, whereas data integration is one of the least agile tasks.
Second, the agility of one skill can be dragged down by the lack of agility in another. For example, report design often flies along until the design needs data that's not already in a data warehouse. The agile process for report design may come to a halt while report designers -- and the business people who need the report -- wait for a data integration specialist to develop technical routines that can acquire and prepare new data.
It takes time to find the right data, develop ETL and other routines to get the data, then prepare the data for reports by repairing its quality and metadata. How can we speed up all these data integration tasks?
A method that's working in many user organizations is to assign a special user -- usually someone trained as a business analyst or data steward -- to be a data integration designer (but not a developer). These hybrid people understand both enterprise data and business people's uses of it. An analyst or steward can make quick and accurate decisions about what data is needed for a specific report or analysis. In some cases, they can actually do the data integration work; usually, however, they specify the needed design, then pass their requirements to a much more technical data integration specialist who does the heavy lifting of development for extract, transform, and load (ETL).
Is it unusual for business analysts and data stewards to specify data integration designs?
Yes, it is a bit of a change. Data stewards have long been liaisons between business and technical people, quickly identifying and prioritizing data quality work. Many firms are now asking stewards to fulfill a similar role in data integration. Likewise, business analysts have become geniuses at throwing together analytic datasets, in reaction to a sudden change in business operations. Now they're being asking to apply this agile skill to mainstream BI work.
What can vendor tools do to make data integration more agile?
A huge chunk of a data integration developer's time is burned up mapping source data elements to specific fields in target databases. Tools from several vendors have become considerably smarter about guessing which source elements should map to which targets. Even if the tool guesses correctly only half the time, that's still a big lift in developer productivity. Similar automation for associating table keys is now available.
The automatic generation of test data for DI is now faster and better. Some tools can look at data profiles you've created and make useful recommendations for data mappings, quality improvements, and business rules for verifying data. Another time sink is the tweaking of table joins; some tools can now determine a join plan that will perform better than most manual optimizations. All these boost productivity to accelerate DI work.
In the applications world, everyone talks about "reuse" as the ultimate productivity booster. Does that apply to data integration?
Indeed it does. The catch is that reuse demands that a developer spend more time up front documenting and generalizing his work, but that time is made up in spades; it's easier for developers to reuse objects created by other developers, objects such as data profiles, business rules, ETL routines, and interfaces.
If we go way out on the leading edge, data services are the ultimate in reuse. This is where data integration functionality is built as a generalized service that many different DI solutions can just plug in and use. A single data service typically supports many interfaces, so it can plug into many source and target systems. The really good data services will automatically behave differently in different contexts.
For example, you sometimes need the same data extraction and processing logic to run in overnight batch, micro batch, on demand, or in real time. A good service can operate in multiple "modes" like this without a developer rewriting it. A truly magical service will automatically sense which behavior is needed in a given context and switch to an appropriate mode. Again, it takes a more work up front to build a multi-modal data service, but the payoff in reuse and agility is worth it.
Philip Russom is the research director for data management at The Data Warehousing Institute (TDWI). Philip can be reached at firstname.lastname@example.org .
Copyright 2011. TDWI. All rights reserved.