TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Data Integration for AI: Overcoming Modern Pipeline Challenges July 23, 2025
  - From Silos to Insights: Centralizing Data to Drive AI July 24, 2025
  - Expert Panel: Leveraging AI-Powered Solutions for Data Management July 28, 2025
  - A Generative AI Framework for Credit and Financial Markets July 29, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Executive Summit AI Accelerate 2025, Brought to You by AI Boadroom & TDWI August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
- Virtual Live Seminars
  - TDWI Data Governance Principles and Practices: Managing Data as an Asset June 25, 2025
  - Building Your Company’s Data Governance Roadmap June 25, 2025
  - Data Governance: Driving Engagement and Organizational Change June 26, 2025
  - A Framework for Modern Data Governance June 25, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

RESEARCH & RESOURCES

ETL on Hadoop Is Here. Now. Today.

To the extent that Hadoop is tapped as an all-purpose landing zone and staging area for enterprise data, it likewise becomes -- by default -- a logical platform for data preparation and transformation.

By Stephen Swoyer
December 17, 2013

There's a reason Hadoop is talked about as a platform for ETL: increasingly, that's where your data is. To the extent that Hadoop is tapped as an all-purpose landing zone and staging area for enterprise data, it likewise becomes -- by default -- a logical platform for data preparation and transformation.

That Hadoop is itself a general-purpose massively parallel processing (MPP) platform makes it even more appealing -- in this case, as a low-cost platform for high-performance, parallel ETL processing.

According to survey data from TDWI Research, more than half of all organizations are managing big data today, primarily in the form of "mostly structured" data (e.g., in database systems of one terabyte or more). Almost one-third (31 percent) say they're managing big data as "multi-structured" information, a category that consists of human-readable, semi-structured, and "unstructured," data types.

What's more, nearly one-quarter (23 percent) of respondents believe their analytic practices could probably benefit from the "improved data staging for data warehousing" that's afforded by a platform such as Hadoop, which can be used to consolidate a diversity of data sources.

That's not all. In a recent Checklist Report, Philip Russom, research director for data management with TDWI Research, assessed how and where Hadoop fits as a platform for data management (DM). ETL, he noted, is an excellent – and, to a degree, an established -- Hadoop use case.

"A considerable amount of data is processed in a [data warehouse's] staging area to prepare source data for specific uses -- reporting, analytics -- and for loading into specific databases [such as data warehouses or data marts]. Much of this processing is done by homegrown or tool-based solutions for extract, transform, and load. Hadoop allows organizations to deploy an extremely scalable and economical ETL environment," Russom wrote.

Another proposed use case – data archiving -- also helps to boost Hadoop's ETL bona fides. "Traditionally, enterprises had three options when it came to archiving data: leave it within a relational database, move it to tape, or delete it. Hadoop's scalability and low cost enable organizations to keep all data forever in a readily accessible online environment," Russom explained.

So, too, does Hadoop's NoSQL trump card: its flexible schema. In this regard, Hadoop offers a relaxed alternative to the rigid schema of the data warehouse, which wants to type and model data prior to ingesting it. True, some vendors now promote the use of techniques such as late-binding to help ameliorate this issue, but -- as with, for example, late-binding -- these techniques (or the in-database implementations that must at any rate support them) are still comparatively immature.

However, the biggest reason an organization might opt to use Hadoop over a traditional ETL platform is also the most fundamental one: its cost. Even in a perfect-world scenario that assumes (1) an MPP engine for ETL processing or ELT push-down and (2) enhanced schema flexibility, e.g., via mature support for late-binding, a data warehouse platform will still cost significantly more on a per-TB basis than will Hadoop.

ETL-on-Hadoop Is Already Here

This is one of the reasons all ETL vendors market versions of their software that run on or which leverage Hadoop. Some ETL-on-Hadoop products leverage Hadoop's MapReduce compute engine to perform ETL operations -- or, more precisely, transformations and manipulations -- while others substitute their own, ETL-optimized libraries for MapReduce.

The latter approach is attractive because even though Hadoop MapReduce can be used to parallelize ETL jobs, it can't necessarily do so efficiently. "Hadoop [MapReduce] is brute force parallelism. If you can easily segregate data to each node and not have to re-sync it for another operation [by, for example,] broadcasting all the data again -- then it's fast," Mark Madsen, research analyst with IT strategy consultancy Third Nature Inc., told BI This Week last year.

On the other hand, given: (1) that data is already residing (dispersed as file blocks) across the Hadoop Distributed File System (HDFS); and (2) sufficient parallelization -- such that ETL workloads can be distributed across a half-dozen, a dozen, or multiple dozens of compute nodes -- this isn't necessarily a problem.

With the arrival of Apache YARN -- i.e., "yet another resource negotiator," a facility designed to make it easier to schedule, monitor, and manage non-MapReduce workloads (such as third-party ETL libraries) in the context of Hadoop -- it's poised to be less of a problem still.

This is particularly true of vendors that run their own ETL libraries in Hadoop.

One thing almost all ETL-on-Hadoop offerings have in common is abstraction. Even though Hadoop MapReduce has the potential to be a powerful parallel processing platform for ETL workloads, this potential is mitigated in practice by the complexity of the Hadoop MapReduce toolset, which -- at least when compared to the toolsets of most enterprise ETL products -- is relatively immature.

For example, MapReduce jobs must be coded in Java, Pig Latin -- the interpretive language used by Pig, a platform designed to abstract the complexity of the MapReduce engine -- Perl, or Python. It's also possible to use open source libraries to code MapReduce jobs in C, C++, Ruby, or other languages. An increasingly popular alternative is to do this by means of Hive, the SQL-like interpreter for Hadoop that compiles MapReduce jobs; in this case, one would code an ETL job in Hive Query Language (HQL) much like one might code it in SQL. To the extent that HQL isn't SQL and that it lacks support for all the features or capabilities of even the ANSI SQL-92 standard; that SQL generator tools offer less-than-perfect support for HQL, if at all; and that most ETL-on-Hadoop tools generate Java (as distinct to HQL), this last approach primarily entails hand-coding.

An important consideration, however, is the fact that Hadoop MapReduce wasn't conceived with ETL in mind: it was designed primarily in the context of Web application development. Coding MapReduce jobs for ETL workloads likewise requires familiarity with (and, ideally, expertise in) data management practices and methods. For this reason, most ETL-on-Hadoop tools expose a drag-and-drop GUI design studio that (in the background) generates MapReduce jobs in the form of Java code.

Some vendors have been shipping ETL-on-Hadoop products for a few years now; others have marketed such products for a year or more. Still others introduced an ETL-on-Hadoop automation facility fairly recently.

With respect to building or designing an ETL job, the user experience (UX) in an ETL-on-Hadoop tool is – even at this comparatively early stage of development – functionally similar to the UX in a conventional ETL tool: e.g., a drag-and-drop, point-and-click, GUI-centric design, with (optionally) facilities to hand-code or tweak SQL, and embed procedural code as user-defined functions (UDF). The tool itself automates the generation of Hadoop-ready code/packages.

"Some people get very excited [about ETL-on-Hadoop], but then they realize that you need very, very skilled IT developers in order to build anything meaningful with Hadoop. That's one of the obstacles," said Jorge Lopez, director of product marketing with Syncsort, in an interview last year.

"They are also realizing that although Hadoop can do lots of things, and you can achieve really good performance, that requires a lot of investment in terms of development and tuning, but also in terms of hardware. People think, hardware is very cheap, but in order to get meaningful performance, you have to throw a lot of boxes, even if those boxes are $4,000 or $5,000, when you've got 20, 50, [or] 100 boxes, it's a lot of cost overhead. Cooling, power, even real estate to house all of these servers, licensing costs, maintenance costs, etc. When we talk about making Hadoop faster and more efficient, that can really resonate with people who are doing interesting things."

In a recent presentation at Teradata Corp.'s Partners user conference, industry luminary Richard Winter outlined a new metric -- viz., "total cost of data," or TCOD -- which can be used to assess the cost of a traditional data warehouse as distinct to a big data platform for different kinds of workloads. One of Winter's examples involved comparisons of SQL queries and MapReduce jobs coded in Java; the MapReduce code required approximately five times as many lines.

Traditionally, these jobs would have to be hand-coded in Java (for native MapReduce ETL) or HQL (for Hive-powered MapReduce ETL) and/or coded/generated in Pig Latin for Pig. One strong selling point of ETL-on-Hadoop offerings, then, is that they're able to automatically generate this code. (In some cases, ETL-on-Hadoop products eschew the use of the MapReduce compute engine altogether. These products nonetheless generate Java code to be fed into Hadoop, where it's executed by their proprietary libraries.)

Another strong selling point is that the Java code thus produced is (ostensibly, at any rate) optimized for ETL: in other words, that the ETL tool itself incorporates a vendor's expertise in data management practices and methods. According to Lopez, these are some of the considerations cited by Syncsort's own customers: in the case of one beta account, he indicated, the organization was able to replace hundreds of lines of hand-scripted Java code with jobs automatically generated by DMX-h. Unlike hand-coded Java, Lopez argued, DMX-h jobs are reusable, manageable, and auditable.

"A lot of the things you do in Hadoop require a lot of manual coding and also a lot of tuning. It's something that a Java programmer could do, but you still have to learn a lot [about] how Hadoop works, and [expertise of this kind is] in very high demand because there's not a lot of people who know how to do it."

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

Find the right level of Membership for you.

Learn More

↑

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

RESEARCH & RESOURCES

ETL on Hadoop Is Here. Now. Today.

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

TDWI

Engage

Research