RESEARCH & RESOURCES

Q&A: Large-scale ETL Driving Hadoop Projects

The need for large-scale ETL drives many Hadoop implementations. However, despite its powerful utilities and massive scalability, Hadoop alone lacks enough functionality for enterprise ETL, argues Syncsort's Jorge A. Lopez.

The need for large-scale ETL drives many Hadoop implementations. However, despite its powerful utilities and massive scalability, Hadoop alone lacks enough functionality for enterprise ETL, argues Syncsort's Jorge A. Lopez. In this interview, he explains what Hadoop offers, why it has become a popular ETL tool, and what is needed to successfully embark on an Hadoop ETL initiative. Lopez, who is responsible for data integration product marketing at Syncsort, has over 14 years of experience in BI and data integration. Prior to Syncsort, he was a senior product manager at MicroStrategy, where he oversaw the technical direction of key business intelligence products and lead performance and scalability initiatives.

BITW: In a nutshell, what does Hadoop bring to the table regarding the management of data, especially big data?

Jorge A. Lopez: Hadoop is quickly becoming the new operating system for managing big data. In a nutshell, Hadoop brings massive horizontal scalability along with system-level services that allow developers to create big data applications at a highly disruptive price point. Much like any game-changing technology, Hadoop has the potential to level the playing field again, wiping the slate clean and creating many new opportunities for organizations across all industries.

What are some of the challenges that companies face in using Hadoop to manage big data?

I'll highlight a few challenges that I believe are sometimes overlooked. The first one is related to the skills gap. Developing and maintaining MapReduce jobs requires a mix of skills that are in very high demand. For instance, finding the right developers, with the right combination of programming -- Java, Pig, Hive -- and data warehousing expertise can be a daunting and expensive task. It can become an even bigger problem as IT departments try to scale Hadoop adoption across the entire organization.

Another challenge is making sure Hadoop does not become another data silo within the enterprise. In other words, the value of Hadoop comes from its ability to leverage all your data at a relatively low cost; this includes traditional sources (such as OLTP, files, and both structured and unstructured data) as well as legacy and mainframe data.

Finally, security is very important. Today, data is one of the most valuable and critical assets any organization possesses. Therefore, any viable Hadoop implementation must comply with the security requirements appropriate for your given industry or sector.

Can Hadoop be used as an ETL solution? What are its limits in that regard?

Absolutely! In fact, ETL is emerging as the primary use case for many Hadoop implementations. However, organizations embarking on Hadoop ETL initiatives must realize that Hadoop is not a complete ETL solution. Although Hadoop offers powerful utilities and massive scalability, it does not provide the complete set of functionality that users need for enterprise ETL. In most cases, those gaps in function must be filled using complex manual coding. Suddenly, we're back to the early 90s, in a world where ETL was done through manual coding, complex code generators, or both.

Think about a change data capture (CDC) job -- something that is widely done in ETL today. However, implementing CDC in Hadoop is very difficult. Data sets are typically much larger and distributed across data nodes in HDFS. Records need to be co-located to identify changes, then a great deal of hand coding and tuning is needed to achieve acceptable performance.

What are some use cases for Hadoop currently, especially as an ETL solution?

One of the most popular ETL use cases is offloading heavy transformations, the "T" in ETL, from the data warehouse and into Hadoop. When you think about it, it makes complete sense. For years, organizations have struggled to scale traditional ETL architectures. Unable to keep up with the "three Vs" of big data, data integration platforms forced IT to push the transformations down to the data warehouse, creating a shift from ETL to ELT.

That's why today, data integration drives up to 80 percent of database capacity, resulting in unsustainable spending, ongoing tuning and maintenance efforts, as well as poor user query performance. By shifting the "T" to Hadoop, organizations are finding they can dramatically reduce costs and free up database capacity for faster user queries.

Another use case that is growing in popularity is the ability to use Hadoop to ingest and analyze mainframe data. Yes, mainframe. It may sound surprising at first, but if you think about it, mainframes still power many mission-critical applications throughout the enterprise. They collect, generate, and process some of the largest data volumes. In fact, mainframe data can be the critical reference point for new data sources such as Web logs and sensor data. Organizations simply can't afford to neglect this data, and they know it. That's why they are looking for ways to translate, ingest, and process their mainframe data in a way that is cost-effective. Unfortunately, until recently, analyzing mainframe data was a very expensive proposition. Thanks to Hadoop, that is no longer true.

What's a typical return on investment from using Hadoop for data management and ETL?

Hadoop will radically change the cost structure of managing data, and thus of ETL. That's why even the most traditional businesses -- Sears is an example -- are shifting to Hadoop. In fact, Sears has been so successful that it spawned a new business -- Metascale -- to provide big data services to organizations.

As to ROI, there are two ways to increase it. First, increase the benefits. With big data, this is achieved by uncovering valuable business insights that otherwise would be lost forever. That, in turn, can translate into higher revenues, increased competitiveness, and so on.

The second way to increase ROI is to reduce the operational costs. Estimates from multiple sources indicate that managing data in Hadoop can range from $1,000 to $2,000 per TB, compared to $20,000 to $200,000 per TB for the enterprise data warehouse. Even using a very conservative estimate, that translates to savings of at least an order of magnitude. Using this as an estimate, you can get a rough picture of what it means to shift your workload to Hadoop. That's the kind of math that is pushing more organizations to look at Hadoop.

What does Syncsort bring to this discussion?

Syncsort is a leading big-data integration company, with thousands of deployments across all major platforms, including the mainframe. With our spring 2013 release, Syncsort introduced a new offering, DMX-h ETL Edition, to help organizations close the gaps between Hadoop and enterprise ETL, turning Hadoop into a faster, more robust, and feature-rich ETL solution. It provides a unique approach for organizations to maximize the benefits of MapReduce without compromising on the capabilities, ease of use, and typical use cases of conventional ETL tools. Anyone interested in looking at DMX-h ETL can take a free test drive at www.syncsort.com/try, without having to set up their own Hadoop cluster. It includes a Linux Virtual Machine with Cloudera CDH 4.2 and DMX-h ETL Edition pre-installed, along with use case accelerators and sample data.

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

Find the right level of Membership for you.