ETL on Hadoop Is Here. Now. Today.

To the extent that Hadoop is tapped as an all-purpose landing zone and staging area for enterprise data, it likewise becomes -- by default -- a logical platform for data preparation and transformation.

There's a reason Hadoop is talked about as a platform for ETL: increasingly, that's where your data is. To the extent that Hadoop is tapped as an all-purpose landing zone and staging area for enterprise data, it likewise becomes -- by default -- a logical platform for data preparation and transformation.

That Hadoop is itself a general-purpose massively parallel processing (MPP) platform makes it even more appealing -- in this case, as a low-cost platform for high-performance, parallel ETL processing.

According to survey data from TDWI Research, more than half of all organizations are managing big data today, primarily in the form of "mostly structured" data (e.g., in database systems of one terabyte or more). Almost one-third (31 percent) say they're managing big data as "multi-structured" information, a category that consists of human-readable, semi-structured, and "unstructured," data types.

What's more, nearly one-quarter (23 percent) of respondents believe their analytic practices could probably benefit from the "improved data staging for data warehousing" that's afforded by a platform such as Hadoop, which can be used to consolidate a diversity of data sources.

That's not all. In a recent Checklist Report, Philip Russom, research director for data management with TDWI Research, assessed how and where Hadoop fits as a platform for data management (DM). ETL, he noted, is an excellent – and, to a degree, an established -- Hadoop use case.

"A considerable amount of data is processed in a [data warehouse's] staging area to prepare source data for specific uses -- reporting, analytics -- and for loading into specific databases [such as data warehouses or data marts]. Much of this processing is done by homegrown or tool-based solutions for extract, transform, and load. Hadoop allows organizations to deploy an extremely scalable and economical ETL environment," Russom wrote.

Another proposed use case – data archiving -- also helps to boost Hadoop's ETL bona fides. "Traditionally, enterprises had three options when it came to archiving data: leave it within a relational database, move it to tape, or delete it. Hadoop's scalability and low cost enable organizations to keep all data forever in a readily accessible online environment," Russom explained.

So, too, does Hadoop's NoSQL trump card: its flexible schema. In this regard, Hadoop offers a relaxed alternative to the rigid schema of the data warehouse, which wants to type and model data prior to ingesting it. True, some vendors now promote the use of techniques such as late-binding to help ameliorate this issue, but -- as with, for example, late-binding -- these techniques (or the in-database implementations that must at any rate support them) are still comparatively immature.

However, the biggest reason an organization might opt to use Hadoop over a traditional ETL platform is also the most fundamental one: its cost. Even in a perfect-world scenario that assumes (1) an MPP engine for ETL processing or ELT push-down and (2) enhanced schema flexibility, e.g., via mature support for late-binding, a data warehouse platform will still cost significantly more on a per-TB basis than will Hadoop.

ETL-on-Hadoop Is Already Here

This is one of the reasons all ETL vendors market versions of their software that run on or which leverage Hadoop. Some ETL-on-Hadoop products leverage Hadoop's MapReduce compute engine to perform ETL operations -- or, more precisely, transformations and manipulations -- while others substitute their own, ETL-optimized libraries for MapReduce.

The latter approach is attractive because even though Hadoop MapReduce can be used to parallelize ETL jobs, it can't necessarily do so efficiently. "Hadoop [MapReduce] is brute force parallelism. If you can easily segregate data to each node and not have to re-sync it for another operation [by, for example,] broadcasting all the data again -- then it's fast," Mark Madsen, research analyst with IT strategy consultancy Third Nature Inc., told BI This Week last year.

On the other hand, given: (1) that data is already residing (dispersed as file blocks) across the Hadoop Distributed File System (HDFS); and (2) sufficient parallelization -- such that ETL workloads can be distributed across a half-dozen, a dozen, or multiple dozens of compute nodes -- this isn't necessarily a problem.

With the arrival of Apache YARN -- i.e., "yet another resource negotiator," a facility designed to make it easier to schedule, monitor, and manage non-MapReduce workloads (such as third-party ETL libraries) in the context of Hadoop -- it's poised to be less of a problem still.

This is particularly true of vendors that run their own ETL libraries in Hadoop.

One thing almost all ETL-on-Hadoop offerings have in common is abstraction. Even though Hadoop MapReduce has the potential to be a powerful parallel processing platform for ETL workloads, this potential is mitigated in practice by the complexity of the Hadoop MapReduce toolset, which -- at least when compared to the toolsets of most enterprise ETL products -- is relatively immature.

For example, MapReduce jobs must be coded in Java, Pig Latin -- the interpretive language used by Pig, a platform designed to abstract the complexity of the MapReduce engine -- Perl, or Python. It's also possible to use open source libraries to code MapReduce jobs in C, C++, Ruby, or other languages. An increasingly popular alternative is to do this by means of Hive, the SQL-like interpreter for Hadoop that compiles MapReduce jobs; in this case, one would code an ETL job in Hive Query Language (HQL) much like one might code it in SQL. To the extent that HQL isn't SQL and that it lacks support for all the features or capabilities of even the ANSI SQL-92 standard; that SQL generator tools offer less-than-perfect support for HQL, if at all; and that most ETL-on-Hadoop tools generate Java (as distinct to HQL), this last approach primarily entails hand-coding.

An important consideration, however, is the fact that Hadoop MapReduce wasn't conceived with ETL in mind: it was designed primarily in the context of Web application development. Coding MapReduce jobs for ETL workloads likewise requires familiarity with (and, ideally, expertise in) data management practices and methods. For this reason, most ETL-on-Hadoop tools expose a drag-and-drop GUI design studio that (in the background) generates MapReduce jobs in the form of Java code.

Some vendors have been shipping ETL-on-Hadoop products for a few years now; others have marketed such products for a year or more. Still others introduced an ETL-on-Hadoop automation facility fairly recently.

With respect to building or designing an ETL job, the user experience (UX) in an ETL-on-Hadoop tool is – even at this comparatively early stage of development – functionally similar to the UX in a conventional ETL tool: e.g., a drag-and-drop, point-and-click, GUI-centric design, with (optionally) facilities to hand-code or tweak SQL, and embed procedural code as user-defined functions (UDF). The tool itself automates the generation of Hadoop-ready code/packages.

"Some people get very excited [about ETL-on-Hadoop], but then they realize that you need very, very skilled IT developers in order to build anything meaningful with Hadoop. That's one of the obstacles," said Jorge Lopez, director of product marketing with Syncsort, in an interview last year.

"They are also realizing that although Hadoop can do lots of things, and you can achieve really good performance, that requires a lot of investment in terms of development and tuning, but also in terms of hardware. People think, hardware is very cheap, but in order to get meaningful performance, you have to throw a lot of boxes, even if those boxes are $4,000 or $5,000, when you've got 20, 50, [or] 100 boxes, it's a lot of cost overhead. Cooling, power, even real estate to house all of these servers, licensing costs, maintenance costs, etc. When we talk about making Hadoop faster and more efficient, that can really resonate with people who are doing interesting things."

In a recent presentation at Teradata Corp.'s Partners user conference, industry luminary Richard Winter outlined a new metric -- viz., "total cost of data," or TCOD -- which can be used to assess the cost of a traditional data warehouse as distinct to a big data platform for different kinds of workloads. One of Winter's examples involved comparisons of SQL queries and MapReduce jobs coded in Java; the MapReduce code required approximately five times as many lines.

Traditionally, these jobs would have to be hand-coded in Java (for native MapReduce ETL) or HQL (for Hive-powered MapReduce ETL) and/or coded/generated in Pig Latin for Pig. One strong selling point of ETL-on-Hadoop offerings, then, is that they're able to automatically generate this code. (In some cases, ETL-on-Hadoop products eschew the use of the MapReduce compute engine altogether. These products nonetheless generate Java code to be fed into Hadoop, where it's executed by their proprietary libraries.)

Another strong selling point is that the Java code thus produced is (ostensibly, at any rate) optimized for ETL: in other words, that the ETL tool itself incorporates a vendor's expertise in data management practices and methods. According to Lopez, these are some of the considerations cited by Syncsort's own customers: in the case of one beta account, he indicated, the organization was able to replace hundreds of lines of hand-scripted Java code with jobs automatically generated by DMX-h. Unlike hand-coded Java, Lopez argued, DMX-h jobs are reusable, manageable, and auditable.

"A lot of the things you do in Hadoop require a lot of manual coding and also a lot of tuning. It's something that a Java programmer could do, but you still have to learn a lot [about] how Hadoop works, and [expertise of this kind is] in very high demand because there's not a lot of people who know how to do it."

TDWI Membership

Get immediate access to training discounts, video library, BI Teams, Skills, Budget Report, and more

Individual, Student, & Team memberships available.