Q&A: Tackling Today's Data Integration Issues
How to get to the root of ETL and data integration issues.
- By Linda L. Briggs
- May 11, 2011
As companies perform more complex transformations on greater amounts of data, performance suffers. To address speed issues and avoid data integration bottlenecks, companies have historically tried adding hardware, shifting the transform process to the database, or hand-coding solutions. There's a better way to address ETL and data integration issues, as Syncsort's Jorge A. Lopez explains here.
"Organizations need to accelerate data integration by surgically removing specific bottlenecks," Lopez says. He advocates targeting the top 20 percent of processes in terms of elapsed time and complexity, since those jobs cause 80 percent of the performance problems. The right tool is also critical, Lopez explains: "Specifically, [you need] a data integration tool designed for high performance and scalability that delivers optimal performance and is self-tuning."
As senior manager of product marketing for data integration at Syncsort, Lopez has over 12 years of experience in BI and data integration. Prior to joining Syncsort, Lopez was a senior product manager at MicroStrategy, where he oversaw the technical direction of key business intelligence products and led performance and scalability initiatives. He recently spoke at a TDWI Webinar on ETL Performance: A Checklist for Accelerating Data Integration.
BI This Week: What sorts of extreme data integration problems do companies face today? Why have these problems worsened?
Jorge A. Lopez: We're witnessing an unprecedented creation of, and demand for, information. More than ever, data is permeating every aspect of our lives, from the way we do business to the individual choices that we make, such as a picking a specific restaurant or consumer product.
The exponential adoption of mobile technologies combined with social media will continue to push the demands for data even higher and faster than ever. This brings an enormous challenge to organizations that must transform terabytes of data into relevant, consumable information that is vital to grow, compete, and survive.
For instance, a major investment bank must deliver hundreds of thousands of reports containing intra-day trading activity for regulatory compliance. Failure to deliver these reports to its customers on a daily basis can result in severe revenue loss as well as legal implications. Similarly, businesses in all industries and markets are increasingly facing data performance bottlenecks in areas such as business intelligence, data warehousing, database marketing, and clickstream analysis, to name a few.
Where in the architecture do bottlenecks tend to occur, and why?
Unfortunately, in most cases, performance bottlenecks tend to occur within data integration tools. The three main components of any data integration architecture are the data sources, the hardware, and the data integration tools. Database and hardware vendors have invested heavily in optimizing and accelerating technologies to cope with this tremendous demand for speed and larger data volumes. Data integration vendors, however, have historically focused on delivering an expanded set of functionality. This adds considerable baggage, makes their stacks heavier, and ultimately creates a clear disconnect between the performance challenges businesses are facing and the capabilities that the data integration tools provide.
How can companies best identify where ETL and DI bottlenecks are occurring in existing data loads?
In the best scenarios, the cases are clear-cut. For example, we recently worked with a company, a provider of information solutions, that needed to join five, one-terabyte tables in less than 24 hours. This process was critical to successfully launch new products and remain competitive. For this organization, the problem became crystal clear when their existing data integration tool required more than an entire week to perform the job.
Unfortunately, identifying performance bottlenecks is not always as obvious as one would think. This is because in many cases, performance issues manifest themselves in a variety of ways. The most common problems we've found include increasing database and hardware expenditures, longer development time for new reports, broken data lineage, and an overall data integration environment that becomes increasingly complex as well as harder to maintain and govern.
Does increasing hardware capacity tend to address the problem? What if performance increases don't follow?
When ETL tools don't scale, companies usually try to work around this problem by resorting to three common approaches: adding hardware, pushing transformations to the database, or hand-coding a solution. However, none of these approaches fully addresses the real problem: the lack of fast, scalable, and cost-efficient ETL tools. At best, these approaches shift the problem elsewhere. This creates secondary problems such as increased costs, longer time-to-value, and lack of governability, to name a few.
With processing power doubling approximately every two years, adding hardware has long been a favorite way to solve data performance bottlenecks for many IT organizations and even some data integration vendors. However, we're at a point where data growth is far outpacing Moore's Law. This means companies cannot simply throw more hardware at the problem to mask it.
Moreover, companies already dealing with terabytes of data a day can no longer afford to close their eyes to all the issues that come with increasing hardware demands: increased capital expenditures; increased cooling, maintenance, and administration costs; even a larger carbon footprint in an industry that is becoming more environmentally aware.
What about shifting the processing load to the database to speed up performance? Does that tend to work?
While databases can be really fast for performing certain types of operations, they can also add considerable overhead. Staging data, another approach that has become a "best practice," is a classic example. Staging is typically used to join tables by bringing data to a homogeneous area and then pushing the join to the database. The resulting table is then used to update the data warehouse. The whole reason this made sense was because traditional ETL tools were not very efficient at performing heterogeneous joins.
What many people failed to recognize was the enormous costs associated with staging. First, any table can take up to three times more space in the database than on file. Then, you have the additional overhead of constantly truncating tables, indexes, and rebuilding indexes over and over. All this translates into more database licenses and greater maintenance costs. What if people had fast, efficient ETL tools that could do joins in memory at close to the speed of -- or even faster than -- indexed joins? For one provider of digital business analytics, this has meant tangible savings of hundreds of terabytes of primary storage a year.
What are better solutions for fixing performance issues and accelerating data integration?
I firmly believe the solution is to directly attack the problem at its roots. In this case, it means organizations need to accelerate data integration by surgically removing specific bottlenecks. In most cases, these bottlenecks are processes that involve complex sorts, joins, and aggregations of large volumes of data. The best approach is to start by identifying and targeting the top 20 percent of these processes in terms of elapsed time and complexity. These are the same jobs that are causing 80 percent of the problems. Therefore, addressing them first can result in relatively quick and easy gains with huge benefits to the organization.
However, you need the right tool -- specifically, a data integration tool designed for high performance and scalability that delivers optimal performance and is self-tuning. As many organizations have learned the hard way, ongoing tuning is an expensive and time-consuming task. Finally, any solution to data performance problems needs to integrate seamlessly with existing data integration platforms. Most IT organizations have already made considerable investments in procuring, deploying, and maintaining their existing DI environments. A real solution should leverage those investments for greater return on investment.
What does Syncsort offer to address issues of data integration?
Syncsort DMExpress was specifically designed to accelerate data integration with software that is fast, efficient, easy to implement and maintain, and cost-effective. Customers around the world use DMExpress to remove data performance bottlenecks and the damaging business ramifications that result from them. Customers do this by "snapping" DMExpress into their existing data integration environments. Benefits to customers include up to 10 times faster elapsed times, up to 75 percent less CPU utilization (which allows them to defer hardware purchases), and five to 10 times lower storage costs. In addition, less database processing frees up more cycles for BI users and analytics and less tuning allows IT staff to spend more time on value-added initiatives, accelerating time-to-value from months to weeks or even days.
The tremendous amount of information that is being created, plus all the shortcomings of ETL and ELT, have created today's mess. DMExpress brings the "T" back to ETL to deliver on the original, unfulfilled promises of ETL from the 1990s. DMExpress allows IT organizations to take back control of their data integration environments and do ETL the way they originally planned: simple, fast and cost-effective, with all transformations happening in one place.