LESSON - Take Your Extract Transform and Load Processes to the Next Level Utilizing p-Extract Transform and Load
By Craig Abramson, Senior Technical Analyst, Syncsort Incorporated
There are numerous ETL (extract-transform-load) products in the marketplace to choose from to help with necessary data processing. But because of the wide range of choices, selecting the right one can be difficult. Many of the products available may have all the bells and whistles, but with the enormous volumes of data that a company may accumulate on a daily basis, this just isn’t enough. Companies must consider performance when processing data, because slow-running applications waste not only time but money as well. That’s why a performance ETL (p-ETL) product is critical.
There are numerous features a p-ETL product should have in order to complete a specific application, as well as be useful in future applications. Features that tend to be important for most projects include:
- Performance—Processing time is critical, so it’s important to choose a product that not only has all of the key features needed, but also can complete the data processing as fast as possible.
- Scalability—Benchmark tests as well as independent reviews should indicate that a product can easily handle the gigabytes of data generated from largetransaction processing applications.
- Support—Since the platform and data formats may change over time, a product should be able to support a variety of data and file formats.
- Ease of use—Some products may require complex commands,while others will perform tasks with the click of a mouse. This is especially important when there are a large number of users. A product that’s easier to use requires less time to be spent on training or assistance.
- Price—Products can range in price from a thousand dollars to over a million. Of course, the higher-end products are not built specifically as ETL products. Also, some products include technical support in the price, while others charge additional fees.
- Recommendations—Ask other people in the industry for recommendations on products that they’ve used, then request free trials of those products.
Once the right p-ETL tool is chosen, it will be used to convert, cleanse, format, and aggregate the data, then pass the data from the warehouse to the appropriate data marts. The data is transformed into a specific format that will help speed querying and reporting time. It can then be accessed and analyzed quickly by the decision makers in the various departments of a company. For example, a marketing analyst can run a report using the sales data mart to determine if there was an increase in product sales soon after the launch of a new marketing campaign. And the easier it is for the decision makers to gather this type of information, the faster they’ll be able to turn it into a business advantage.
Companies must consider performancewhen processing data, because slowrunningapplications waste not only timebut money as well.
Beginning the p-ETL Process
In order to take advantage of a p-ETL product, you should begin by evaluating your company’s data requirements. This includes deciding what data analysis applications are needed, as well as examining the type of data that is available. You’ll also have to identify where each kind of data comes from, how often it is updated, how it is currently being used, and where it can be stored within the company. Then you’ll have to decide how you are to going to use it and what must be done to clean the data and transform it. Once this is completed, the next step is to pre-process the data prior to loading it into the data warehouse and database.
Pre-processing allows you to utilize the p-ETL product to reformat the data for seamless integration, and also results in faster and more efficient database and data warehouse loads. Here is a general rule to keep in mind whenever you have a sizable amount of data to load:
The format and sequence of load data should be as close as possible to its format and sequence in the database.
The reason for this rule is simple. Although some database engines load quite efficiently, databases are optimized for query processing and related tasks, not loading. By pre-processing your data before loading, you leave your database free to do the work it’s been designed for, and you gain several performance advantages:
- Your database will load faster
- Indexing will be more efficient
- Data will be ready to use sooner
- Your database will be free to do other work
Transforming data makes it easier and more efficient to access just the information a company needs directly, minimizing the elapsed time of the applications. There are three different levels at which the data transformation processes can take place:
- Source-level operations—At this level, database tables can be converted to flat files and vice versa. This allows all of the data to be converted into the same format to speed processing.
- Record-level operations—This encompasses such processes as joins, sorts, merges, or just copying records to the appropriate target(s). Before outputting the records, they can be filtered or reformatted for faster access. Aggregations can also be performed at this level.
- Field-level operations—This includes data type and format conversions, arithmetic operations, string operations, date-time operations, pattern matching, and conditional operations. These processes help remove extraneous information, allowing only the necessary data to move through the data warehouse.
These kinds of transformations allow data to be cleansed and business rules for data quality to be created. The streamlined data can then be loaded into a data warehouse or other dataintensive application.
Improve Query Response Time
When it comes to data warehouse query performance, speed is everything. Faster performance means needed data and analysis is available to act upon—quickly. In order to achieve this goal, data warehouse experts agree that data aggregates can be a powerful tool. According to Ralph Kimball, author of The Data Warehouse Lifecycle Toolkit, data warehouse administrators can “expect anywhere from a tenfold to a thousandfold improvement in runtime performance by having the right aggregates available.”
Aggregates are pre-stored summary records held in a data warehouse, and are usually built from the base layer upward, culminating in high-level totals. For example, a data warehouse administrator may decide to build aggregates for sales based on product, store, and timerelated totals such as month and year. Tables of aggregates eliminate the time and resources required to summarize data each time it’s needed to respond to a query, allowing the query to be answered very efficiently in just a few seconds. A p-ETL product will provide you with the ability to create these aggregates quickly and dramatically improve response time.
When it comes to data warehouse queryperformance, speed is everything. Fasterperformance means needed data andanalysis is available to act upon—quickly.
An ETL product just isn’t enough if it can’t complete the necessary processing within the shortest amount of time. Utilizing a performance ETL product can mean the difference between a company that realizes its full potential and one that just keeps missing the mark. After all, p-ETL offers faster availability of data, improved manageability, better decisionmaking capabilities, and much more.
This article originally appeared in the issue of .