LESSON - When Enterprise Extract Transform Load Slows Down … Speed It Up
- By Lia Szep
- October 18, 2007
By Lia Szep, Senior Technical Analyst, Syncsort Incorporated
Many of your BI applications rely on data gathered from your Web site. With 5 gigabytes of new Web logs a day, you’ve been tasked with developing an application for Web log processing. You’ve done all your homework, chosen the latest and greatest tools, and made your recommendation to management. The tools are bought, the development is done, and the project is in production; it seems you’ve made the right decision. Unfortunately, as hundreds of thousands of records turn into millions, and then billions, the system grows sluggish and reporting deadlines begin to pass unmet. Typically, the next phase of the project begins with the head-scratching question, “What was I thinking?” and ends with management’s directive to “make it work.” The last thing you want to do is start from scratch—so don’t.
The most likely reason for a processing meltdown is exponential data growth. The more successful a company becomes, the more data it generates. As companies continue to be bogged down by data growth, performance bottlenecks create pain points throughout the organization. A popular Web site can create billions of clickstream records in a single day. High-volume sales drill down to detailed financial records, increased customer data, and additional marketing demographics.
Then there’s legacy, inventory, billing, etc. This wealth of data often translates to longer processing times, increased costs, and missed reporting deadlines. When an enterprisewide extract, transform, load (ETL) solution begins to slow down, the impact is felt in everything from daily business transactions to high-level business analytics.
Don’t Go All Out
Code tweaks are not likely to speed your processing, and new hardware can be a costly Band-Aid that will eventually fall off. Even costlier would be a complete rip-and-replace. Instead, find a cost-effective solution that improves the performance of your current tool rather than replaces it—just make sure it fits the bill. In addition to finding a solution that works with your existing ETL, consider the following objectives.
Perhaps the most important objective will be to minimize the run time of an application. No matter what the other aspects of your system might be, unless you can minimize the time it takes to complete a job, you have no chance of meeting reporting deadlines. The key to minimizing run time is a solution that offers powerful aggregation and join tools.
Aggregates are the best way to speed warehouse queries. A query answered from base-level data can take hours and involve millions of data records and millions of calculations. With precalculated aggregates, the same query can be answered in seconds with just a few records and calculations.
High-performance aggregation simplifies the creation, administration, and execution of aggregation jobs. It summarizes data much faster than other aggregation methods such as C programs, SQL statements, or third-party, multipurpose data warehouse packages. It provides the flexibility necessary to select the best aggregates for optimizing query performance.
High-performance join significantly improves the efficiency of preprocessing, retrieval, and updating in a dimensional data warehouse. With high-performance join, you can:
- Optimize data preparation
- Improve query performance
- Reduce the quantity of data that must be processed in response to a query
- Speed lookups and application matching
- Retrieve and summarize information more efficiently
- Minimize storage and throughput requirements
- Reduce elapsed time of changed data capture/delta processing
The solution must also be scalable. As companies go from gigabytes to terabytes and ultimately petabytes, scalability becomes a necessary tool in managing exponential data growth. Other things to consider are ease of use and vendor reputation. If a product is easy to use, it can help control the cost and time of training staff members. If the solution is from a reputable vendor, it is more likely to have been proven reliable.
Finally, whatever other options you may be considering, testing the product in your own environment, with your own data, is absolutely the only way to determine a product’s performance. Not only must the solution be the best—it must be the best for you. This is also the only way to truly tell if a solution will work with your current tools.
Our customers are often surprised to learn that, rather than replacing their existing ETL, they can reap more value from it simply by supplementing it with a high-performance solution. In one instance, a top telecommunications company needed to convert and migrate more than one terabyte of telephone numbers, phone inventory data, and activation records from its existing system to a new one. This data needed to be converted and normalized in the old system first, implementing complex business rules, so that it could be efficiently used in the new environment, consisting of an HP9000 800 Series with 8 CPUs.
After several dry runs of the more than 100 processes using Informatica and C++ to convert the data to the new application, the company estimated the project would take a week to complete. The time factor was of critical importance: during the conversion/ migration process, all retail stores would have to be shut down—for seven days. This meant that every hour could cost the company millions of dollars in lost revenue.
Using the high-performance solution, the company was able to speed the sorting, summarization, joining for lookups, and conditional processing and transformation of the data. Five major input files were created, two for a phone number inventory scheme (CTN), and three for a phone inventory scheme (ESM). High-performance join was used for the lookups, after which the data was processed and transformed to meet the company’s requirements.
While the company still used Informatica, the high-speed solution was able to replace many of the slow-running processes, which cut the overall elapsed time by 88 percent. As a result, the project, which the company had anticipated would take one week to complete with just Informatica and C++, was finished in less than 20 hours. Without having to replace their existing system, the company was saved from a potentially heavy loss of revenue, something that certainly would have happened if the stores had been closed for an extensive period of time to accommodate the project.
Ripping out a sophisticated, undoubtedly expensive, system may not be the solution to your performance problems. In your search for a fix, remember that performance is the bottom line. Minimized elapsed time, combined with scalability and ease of use, add up to increased performance. Also, in evaluating a solution’s performance, be sure you can do so in your own environment. This is the only way to determine the best solution for you and for your current system.
If you find the right tool to complement your current system, the light at the end of the tunnel may end up even brighter than you first expected.
This article originally appeared in the issue of .