Can Hadoop Replace My ETL Tool?
The answer is: It depends. ETL and ELT jobs vary a lot. Hadoop is suitable for some but not others.
- By Philip Russom, Ph.D.
- February 24, 2015
The last article I wrote for TDWI's BI This Week (Can Hadoop Replace a Data Warehouse) got a blizzard of clicks from readers because it touched the "Hadoop nerve." That article helped many people sort out which data and workloads belong on which data platforms in modern multi-platform data architectures. At the risk of touching another Hadoop nerve, this article sorts out which types of ETL and ELT jobs are suited to Hadoop versus those that are best served elsewhere.
We say "ETL" as if it's one monolithic thing. Actually, ETL jobs vary considerably and involve numerous enabling technologies. In the context of Hadoop, two broad categories of ETL workloads are relevant: those that require substantial relational technologies and those that don't.
At one extreme, many ETL jobs join three or more tables, execute complex SQL routines of hundreds of lines, create temporary tables, or involve multi-pass SQL. These relational ETL jobs are often developed and executed with a mature ETL tool, and the tool may push relational processing into a relational database management system (DBMS). This is usually called "ETL push down" or "ELT." In these cases, the T (i.e., data transformation) occurs in a relational database or similar data platform instead of on the ETL tool hub.
For heavily relational jobs, Hadoop is an unlikely candidate because ANSI standard SQL and other complex relational technologies are not fully supported on Hadoop today. Even so, Hadoop is improving rapidly, and third-party tools are emerging to provide a relational front-end for Hadoop, so it's probable that Hadoop's relational capabilities will soon be more compelling for heavily relational and SQL-based processing.
At the other extreme, some ETL jobs simply need basic relational capabilities (as seen in an HBase row store or a Hive table) or no relational capabilities at all (as is typical of the algorithmic approach of most hand-coded MapReduce jobs). For example, some early adaptors of Hadoop have migrated operational data stores to Hadoop, which manage customer masters, archives of transactions, or industry specific data (such as call detail records in telco or supply chain documents in retail and manufacturing).
ETL jobs that make simple aggregations, summations, and calculated values (but at massive scale against millions of records) are well-suited to the Hadoop environment, and these jobs can be developed for a fraction of the cost of a high-end ETL tool -- if you have the appropriate in-house programming skills. Let's not forget that Hadoop originated in Internet firms, where it did simple but massive summations of clicks, page views, and ecommerce transactions. For workloads resembling those, Hadoop continues to be a compelling and cost-effective platform.
Before closing, let's return to the central question I raised: Can Hadoop replace my ETL tool? Although this is a common question at the moment, it's rather off-base. TDWI surveys show that Hadoop is most often an extension of existing systems and practices and rarely a replacement for anything. A better question might be: How can Hadoop extend my ETL solutions?"
Here are a couple of answers:
- Mature data integration teams have a diverse portfolio of tools for data integration and related disciplines, such as data quality, metadata, master data, and replication. Most data integration architectures include mid-tier databases and file systems for staging, processing, and distributing data in transit. Adding Hadoop into the portfolio of tools and data platforms seems to me to be an extension of these trends -- without replacing anything.
- Vendor-supplied data integration tools have long supported processing that's pushed down into source systems (e.g., an operational DBMS) or target systems (e.g., a data warehouse). Today, leading vendor tools already provide interfaces into the usual Hadoop connection points (HDFS, HBase, HiveQL, Hive tables, MapReduce, Impala, Drill, etc.). Even if you want to process certain datasets on Hadoop (as described earlier), a hub-based data integration tool can still provide control and architectural integrity in the context of a multi-platform data flow.
For further reading, you can freely download several TDWI Best Practices on Hadoop topics and big data topics at www.tdwi.org/bpreports. Recommended reports include: Integrating Hadoop into Business Intelligence and Data Warehousing (2013), Managing Big Data (2013), Evolving Data Warehouse Architectures (2014), and Hadoop for the Enterprise (coming April 1, 2015).