A Closer Look: Talend's Big Data Sandbox
You can lead someone to the big data lake, but you can't make them take the plunge.
- By Stephen Swoyer
- September 30, 2014
You can lead someone to the big data lake, but you can't make them take the plunge.
You can give them a helpful push, however. That's just what open source data integration (DI) specialist Talend says it's trying to do with its new Big Data Sandbox, which it positions as a "virtual data integration environment" for big data use cases.
By "virtual," Talend means just that: its Big Data Sandbox comes as a pre-configured virtual machine environment, complete with a commercial Hadoop distribution -- Talend offers the Big Three flavors: Cloudera, Hortonworks, and MapR -- Hive, Pig, Talend's Platform for Big Data, and other requirements. Presto, says Yves de Montcheuil, vice president of marketing with Talend: an instant proof of concept. Sort of.
The "sort of" part comes by way of Talend's "Big Data Insights" cookbook, a how-to guide (with working examples and video tutorials) that addresses four common big data use cases: ETL offloading, clickstream analysis, social media sentiment analysis (specifically, Twitter) and weblog analysis. Users can download the Sandbox, watch the tutorials, and play with the examples to determine which (if any) big data use cases might be of value.
"Getting started with big data is still not easy for people. They are clearly getting value from it once they get into it, but getting into it is perceived as a problem. You need to know what you're doing, you need to download a Hadoop distribution, you need to configure your environment and dependencies. Even with Cloudera, Hortonworks, and MapR, it still takes some work to configure and install it," de Montcheuil says.
"What we are doing is making it possible for you to download a preconfigured [virtual machine] that will run on the common VM platforms. You will literally be able to download, install, and get this [Big Data Sandbox] up and running in less than 10 minutes," he continues.
Talend's Big Data Sandbox ships with a trial version of Talend's Platform for Big Data, which is offered with a 30-day license. It's possible to build a working proof-of-concept in the trial version and move it (along with all collateral) to a production implementation, de Montcheuil explains, although he concedes that this wouldn't exactly be a turnkey proposition.
"Technically, it's a full-featured product," says de Montcheuil, "so if you decide to license it, everything that you have [built in it] is reusable. If you don't want to license it, you don't need to reinstall anything [once the trial ends]; it's just a matter of reinstalling the license key."
ETL and the Future of Hadoop
In some ways, Talend and its competitors have already harvested Hadoop's lowest-hanging fruit – its use and applicability as a platform for inexpensive distributed data storage and parallel data processing. Of the four use cases described in Talend's "Big Data Insights" cookbook, ETL offloading is by far the most popular, de Montcheuil acknowledges. "Many" Talend customers are dabbling with clickstream analysis and machine/sensor log analysis, he says -- although fewer are using either technology in production. A few customers are using social sentiment analysis to buttress other efforts, such as a 360-degree view of a customer.
However, ETL offload has (by a wide margin) the most traction, de Montcheuil reiterates.
The reason: not only is it well-supported by third-party products but -- quite aside from the allure of Hadoop's cheap parallelism -- more data is being landed and stored in the Hadoop environment. If it's already there (and at significant volumes, too), it makes more sense to process it in place. ETL offload in the context of Hadoop, then, could be said to take the form of something like TEL: i.e., data transformation is actually pushed up to or scheduled on Hadoop, typically as one or more MapReduce jobs. The goal is to produce smaller working data sets, which can then be extracted from Hadoop and loaded into one or more target platforms.
Talend's Platform for Big Data is what's called a parallel-ETL-in-Hadoop offering. It generates ETL jobs and schedules them to be parallelized across a Hadoop cluster, where they're in turn processed (by Hadoop's resource manager) as a series of map and reduce operations. Talend, along with Syncsort Inc. and the former Pervasive Software (now owned by Actian), were among the first players to offer a parallel-ETL-in-Hadoop capability. (Technically, Pervasive and Syncsort were first to market with parallel-ETL-on-Hadoop, beating their competitors -- including Talend -- by more than a year.)
The key to any ETL-on-Hadoop technology is abstraction: programming for Hadoop is hard, and commercial third-party tools have concentrated on making it easier -- or have tried to eliminate it altogether. For example, Hadoop has a primitive SQL-like interface in Hive Query Language (HiveQL), along with an even more primitive SQL-like-to-MapReduce interpreter in Hive, which compiles HiveQL queries into MapReduce jobs. Even though alternatives such as Pig are more efficient and/or scalable than Hive, they tend to entail language-specific and domain-specific (such as knowledge of the kinds of map-and-reduce operations used in ETL) expertise.
It's in this respect that Talend's DI model aligns neatly with Hadoop's execution model: the Talend Platform for Big Data generates data integration jobs in the form of Java, PigLatin, or HiveQL executables, which it then schedules to run on Hadoop. "It gives you the ability to natively run the transformations inside Hadoop by generating MapReduce code, Pig Latin, HiveQL, and more recently, Spark," explains de Montcheuil, referring to Spark, a scalable, interactive compute engine that can run on top of Hadoop.
In addition to ETL, the Platform for Big Data also includes Talend's data quality technology, he says. "You can actually cleanse your data, you can augment your data, you can run matching data deduplication inside Hadoop. You can very quickly bring your data into the sandbox using the [tools included in the] Big Data Sandbox."
"ETL offload is the biggest use case of Hadoop. It's the first thing [most adopters] do."
With Spark, which supports asynchronous and interactive workloads, de Montcheuil anticipates even more DI-on-Hadoop uptake. Spark leverages another Apache OSS project – Tachyon – as a distributed in-memory data store, but it's also able to coexist with HDFS; in addition, it has its own SQL interpreter, Shark, which (although considerably more promising than Hive) faces an uncertain future. For these and other reasons, Spark has lately been suggested as an alternative and/or replacement for Hadoop's MapReduce compute engine.
This is doubtless premature; however, an ability to support interactive, asynchronous workloads helps make Spark better suited than is Hadoop's native MapReduce engine for general-purpose analytic and data integration workloads. At the very least, de Montcheuil argues, it gives adopters a compelling one-two punch.
"Right now, we have experimental support for Spark. We did it with a partner, a system integrator partner, ... and we have some customers who are using it right now on experimental support. They're finding it great. They say that it runs up to 100 times faster than MapReduce," he says, noting that another recent innovation -- YARN, or yet another resource negotiator -- permits Hadoop to more effectively run and manage Spark and other compute engines, in addition to MapReduce.
The upshot, de Montcheuil says, is that Hadoop is slowly but surely morphing into a credible platform for mixed workloads. "YARN enables Hadoop's transition from a batch processing platform into a real-time data system. In future, you will have multiple processing engines on Hadoop for different needs in the same way that [today] you have batch ETL and you have ESBs. You don't use an ESB to load your nightly batch into your warehouse, but you do use an ESB for real-time [workloads]. So [it is] with MapReduce and Spark. The future is to have mixed workloads. When you need a real-time response, you will use Spark; when you need to do something like build[ing] profiles of your customers, you will use MapReduce."