Talend "All-in" on Hadoop, MapReduce

Open source integration specialist Talend says it's all-in on Hadoop and MapReduce

By Stephen Swoyer
September 11, 2012

When it comes to Hadoop and MapReduce, open source data integration (DI) specialist Talend Inc. says it's all-in.

For starters, this means tapping the Hadoop framework to parallelize ETL and other data integration jobs. This in itself isn't controversial; competitors Syncsort Inc. and Pervasive Software Inc., among others, likewise claim to exploit Hadoop and its distributed file system (the Hadoop Distributed File System, or HDFS) to help parallelize ETL.

The difference -- and the controversy -- has to do with the way in which ETL or DI gets parallelized: Talend uses MapReduce itself to perform ETL or DI processing; Syncsort and Pervasive claim to use their own ETL or DI technologies.

"We're actually all-in as far as Hadoop is concerned. We announced back in March that [Hadoop leader] HortonWorks was embedding [Talend] OpenStudio. We're working with Cloudera, too: we're supporting all of the evolutions of Hadoop, literally as they come online," said Yves de Montcheuil, vice president of marketing at Talend.

According to de Montcheuil, Talend proposes to parallelize all of its DI-related workloads -- from ETL to data quality to data cleansing -- across Hadoop and HDFS.

"Wwe will run all of the integration, all of the cleansing jobs inside Hadoop by generating native code," he says. "This can be [a] MapReduce job, or it can be Pig script, HQL [i.e., Hive Query Language], [or] HBase SQL. We will be running against the Hadoop ecosystem component that makes the most sense for what you need to do, and we will do that without requiring you to understand the underlying language."

The idea of MapReduce-based ETL is by no means new. Three years ago, Dan Graham, general manager for enterprise systems with data warehousing powerhouse Teradata Corp., famously described MapReduce as "ETL on steroids." The idea, said Graham, was that MapReduce could accelerate -- in fact, supercharge -- certain kinds of ETL jobs.

That's true, says Mark Madsen, a principal with information management consultancy Third Nature Inc., but the idea of MapReduce-powered ETL is also something of a double-edged sword. In other words, says Madsen, although MapReduce can be used to accelerate certain kinds of ETL jobs, using it as a general-purpose ETL accelerator isn't necessarily a good idea.

"Hadoop [MapReduce] is brute force parallelism. If you can easily segregate data to each node and not have to re-sync it for another operation [by, for example,] broadcasting all the data again -- then it's fast," he explains. The problem, Madsen notes, is that this isn't always doable.

De Montcheuil concedes that the idea of MapReduce-powered ETL has its skeptics. Nevertheless, he argues, Hadoop's parallelism can address most performance concerns. More to the point, de Montcheuil maintains, Hadoop is emerging as a central site for DI in the enterprise.

One upshot of this is that the open source software (OSS) community and vendors such as Talend are actively working to address Hadoop's and MapReduce's shortcomings. The Hadoop/MapReduce stack is going to get better, de Montcheuil promises, and -- as it improves -- it's going to become ever more central to information management. Even 16 years ago, after all, the OSS Linux operating system (which likewise comprised a combined stack -- the Linux kernel itself, plus the more mature OSS GNU software stack) was by no means a match for mature Unix platforms. This was true for more than a decade after its birth. However, Linux got better in a big way on the strength of both community- and vendor-initiated contributions.

"I know that there are criticisms, [but] it sometimes is good to have a fresh approach and not to be encumbered with third-, fourth-, or fifth-normal form. Those debates [regarding database normalization] have gone on for years," he notes.

"We don't view Hadoop as another silo, and certainly not as a new island of data unique [to itself]," de Montcheuil continues. "It's part of the mainland or archipelago of systems that are deeply integrated. That's exactly what data integration is doing. ... You build bridges or levees between Hadoop and the other applications."

Talend's approach differs from competitors such as Syncsort or Pervasive. As part of its DMExpress Hadoop Edition, for example, Syncsort opts to run an instance of its ETL engine on each Hadoop node in place of MapReduce; Syncsort developed a plug-in or library that it says enables Hadoop to make calls to instances of DMExpress distributed across Hadoop nodes, said Jorge Lopez, senior manager for data integration with Syncsort. It opted for this approach, Lopez maintains, because DMExpress is more highly optimized for ETL and DI than is MapReduce.

Pervasive doesn't propose to use MapReduce, either. Dave Inbar, senior director of big data products with Pervasive, famously described Hadoop as "a beautiful platform for all kinds of computation." Like Syncsort's DMExpress, Pervasive's DataRush DI engine can be deployed either outside Hadoop -- pulling information into DataRush from HDFS -- or across Hadoop, i.e., as a MapReduce replacement. In an upcoming version of Hadoop, Inbar said, "DataRush just becomes another compute-paradigm engine that's managed and visible to everything else in Hadoop, but it gives you [DataRush's] pipelining and parallelism benefits inside the Hadoop infrastructure."

Making Hadoop Safe for DM?

Talend's Eclipse-based Open Studio design environment translates between the strange or unfamiliar lingua franca of the Hadoop world -- with its array of cryptic and unfamiliar tongues (e.g., "Pig" or "HQL") -- and the more familiar language of SQL.

In this regard, Talend proposes to address what Third Nature's Madsen identifies as Hadoop's biggest drawback: its lack of usable, intelligible, data management-oriented programming or management tools. In Hadoop, says Madsen, "it's incumbent on the programmer to code it right, which is an advantage of SQL and [database] data flow architecture. Pig script is an attempt at declarative programming over H, but with the flaw that there's no [logic] behind it, unlike SQL; therefore, optimization is hard."

Back in May, Talend introduced Talend Open Studio for Big Data, version 5.1. The revamped Open Studio handles most of the translation between the SQL-centric realm of DI and the more programmer-oriented tools of the Hadoop stack, says de Montcheuil.

"Cloudera and HortonWorks have done a terrific job of bringing stable, enterprise-ready versions of Hadoop to the marketplace ... but there's one thing that remains, which is very complex. You still have to learn Pig, Hive SQL, [and] MapReduce," he points out, adding that -- as of version 5.1 of Talend Open Studio -- Talend is able to abstract most of the complexity associated with the underlying Hadoop ecosystem.

In other words, ETL architects, DBAs, and other data management professionals don't need to understand MapReduce, Pig, HQL, or other unfamiliar technologies to design DI jobs. OpenStudio handles the translation for them.

"Today we have all of the Hadoop ecosystem components covered," de Montcheuil concludes, citing Talend's support for "Oozie," Hadoop's new workflow and scheduling facility. (Before Oozie, Hadoop used to use the Unix cron utility to schedule jobs.)

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

Find the right level of Membership for you.

Learn More

TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

RESEARCH & RESOURCES