LESSON - Third-Generation Extract Transform and Load: Delivering the Best Performance
By Yves de Montcheuil, Director of Product Marketing, Sunopsis
As computer systems started to evolve from monolithic mainframes to distributed computing systems, and as business intelligence made its debut, the first ETL (extract, transform, load) solutions were introduced. Since that time, several generations of ETL have been produced.
First Generation: The Origin of ETL and the Legacy Code Generators
Original data integration tools generated native code for the operating system of the platform on which the data integration processes were to run. Most of these products actually generated COBOL, since at that time data was largely stored on mainframes. These products made the data integration processes easier than they had been by taking advantage of a centralized tool to generate data integration processes and by propagating the code to the appropriate platforms—instead of manually writing programs to do so. Performance was very good because of the inherent performance of native compiled code, but these tools required an in-depth knowledge of programming on the different platforms. Maintenance was also difficult because the code was disseminated to different platforms and differed with the type of sources and target.
Second Generation: The Proprietary ETL Engines
Next came the second generation of ETL tools, which are based on a proprietary engine that runs all the transformation processes. This approach solved the problem of having to use different languages on different platforms, and required expertise in only one programming language: the language of the ETL tool itself. However, a new problem arose: the proprietary engine performing all the transformations became a bottleneck in the transformation process. All data, coming from various sources to go to the target, had to pass through an engine that processed data transformations row by row—a very slow approach when dealing with significant volumes of data.
Third Generation: The E-L-T (Extract, Load, Transform) Architecture
Addressing the challenges faced by tools from the previous two generations while leveraging their respective strengths, a new generation of ETL tools recently appeared. Since the inception of the previous generation—proprietary engines—database vendors have invested considerable resources to greatly improve the capabilities of their SQL languages. By leveraging these improvements, they have made it possible for an ETL tool to generate and execute highly optimized data integration processes, driven by the native SQL (or other languages) of the databases involved in these processes.
This third generation—E-L-T architecture—provides a highly graphical environment, along with the ability to generate native SQL to execute data transformations on the data warehouse server. This new approach has several clearly identifiable advantages:
- It eliminates the ETL hub server sitting between sources and target, which was introduced by the second generation of ETL products.
- Using an RDBMS to execute data transformations allows bulk processing of the data. Bulk is up to 1,000 times faster than row-by-row data processing. The larger the volume of data to manage, the more important the bulk processing becomes.
- It also provides for better performance than any other solution, because transformations are executed by the RDBMS engine in bulk instead of row by row with second-generation ETL. In addition, the large database vendors—Oracle, IBM, Microsoft, Sybase, and so on—have had significantly more resources and time to invest into improving the performance of their engines than have the vendors of second-generation ETL software. Relying on the RDBMS engines provides a way to leverage these investments.
- In the E-L-T architecture, all database engines can potentially participate in a transformation—thus running each part of the process where it is the most optimized. Any RDBMS can be an engine, and it may make sense to distribute the SQL code among sources and target to achieve the best performance. For example, a join between two large tables may be done on the source.
- A centralized design tool makes programming and maintenance easy, allowing developers to control which database engine processes which piece of information in which way.
Today’s RDBMSs have the power to perform any data integration work. Third-generation E-L-T tools take advantage of this power by leveraging and orchestrating the work of these systems—and processing all data transformations in bulk.