Hadoop, Cheap Storage, and Parallel Processing
Casting Hadoop as a platform for cheap storage -- e.g., a Hadoop-based "data lake" -- only gets at half of what makes Hadoop new, compelling, and uniquely valuable.
- By Stephen Swoyer
- May 26, 2015
Never mind the hype, says Jamie Keeffe, a product marketing manager with master data management (MDM) specialist Redpoint Global Inc. Casting Hadoop as a platform for cheap storage -- e.g., the increasingly ubiquitous Hadoop-based "data lake" -- only gets at half of what makes Hadoop new, compelling, and (from a customer perspective) uniquely valuable.
Remember, Keeffe points out, Hadoop combines a scalable distributed storage layer -- HDFS, or the Hadoop Distributed File System -- with a baked-in, general-purpose parallel processing layer.
This combination is new if not exactly unprecedented. The massively parallel processing, or MPP, database boasts something similar -- with one (critical) caveat: an MPP database isn't in any sense a "general-purpose" parallel processing environment. It's optimized specifically for query processing.
These days, Hadoop can do it all, says Keeffe, who claims he's frustrated with the many ways in which Hadoop has been co-opted by self-serving vendors.
"The way most [master data management or data quality vendors] treat Hadoop, it's effectively relegated to the role of just cheap storage. They're designed to move large volumes of data out of Hadoop across the wire and into a traditional MDM [hub] or data quality [engine] for processing. Adding insult to injury, the data [that's processed] in the traditional MDM [hub] is then moved back into Hadoop -- and it's no richer for the journey."
Keeffe arguably has a self-serving interest -- his company, RedPoint, markets its own Data Management Platform for Hadoop, after all -- but he does make a good point. Because Hadoop consolidates storage and compute, it's possible to bring processing to the data -- instead of consolidating data at a central site and processing it there.
In this scheme, which is no less attractive because of its low cost (Hadoop is a comparatively inexpensive platform for both distributed storage and parallel processing), Hadoop becomes the de facto storage repository for relevant business information -- or, in some implementations, for all information that's generated or collected by an organization.
This begs a question that's by no means specific to Keeffe and RedPoint: if an organization has already invested in Hadoop, why shouldn't it take advantage of the Hadoop platform's cheap, scalable storage? Why shouldn't an enterprise use Hadoop as a context in which to stage and prepare data, as well as to cleanse and standardize it?
This is precisely Keeffe's point.
"For data management, this [self-serving] approach creates some serious opportunity costs. When you're doing such things as identifying, matching, or linking with highly configurable rules, [those kinds of workloads] can significantly benefit from the ability to split off a job and process it across Hadoop nodes or even [across] the [SMP] cores within those individual nodes," he points out.
"However, if you move [the data] out of Hadoop for processing in an external MDM [hub or data quality engine], you lose that ability, and [parallel processing for MDM and data quality] is an obvious business application for Hadoop. If you're doing MDM, you can process data as you load it, or shortly [there]after. The kinds of [operations] you do in MDM -- [for example,] identifying and matching -- can benefit from Hadoop's parallelism. [Hadoop] jobs can be multi-threaded so that they split themselves up to run across the different nodes to take advantage of available compute resources. If [a job in Hadoop is] YARN-certified, you can have really fine-grained control. You can have [jobs] that would take potentially hours to run in an external MDM [hub] completing in just minutes."
By YARN, Keeffe means the new resource manager (YARN is actually a bacronym for "Yet Another Resource Manager") that debuted, 18 months ago, with version 2.0 of Hadoop. YARN is the culmination of a massive overhaul of Hadoop's baked-in parallel computing architecture. Unlike Hadoop v1.x -- which was tightly coupled to a batch-only implementation of the MapReduce engine -- Hadoop 2.x and YARN now support interactive and query workloads (via Apache Tez) and real-time data processing (via Apache Slider), along with, of course, brute-force batch workloads (via legacy MapReduce).
More important, YARN makes it possible for third-party engines, such as Apache Spark, to run as full-fledged citizens -- complete with granular resource management -- in a Hadoop cluster. (It's always been possible to run-third party engines in Apache Hadoop, but -- prior to YARN, and absent the use of distribution-specific or proprietary management tooling -- it wasn't possible to manage or, more precisely, to allocate compute resources for non-MapReduce jobs.)
In other words, versions of Hadoop prior to 2.0 were tightly coupled to MapReduce, such that it wasn't possible to schedule and parallelize -- with anything approaching granularity -- non-MapReduce workloads in the Hadoop environment. YARN decoupled Hadoop from this dependence.
Keeffe could be said to have a self-serving interest in playing up this aspect of RedPoint's Hadoop integration, however. After all, RedPoint claims that its Data Management Platform for Hadoop is a "native" YARN application. If you're shrugging your shoulders thinking "Big deal," Keeffe begs to differ. There's a world of difference, he argues, between "YARN-ready" software and applications -- such as RedPoint Data Management for Hadoop -- that are YARN native.
For example, an application that uses Hive -- a SQL-like interpreter for Hadoop that compiles Hive Query Language (HiveQL) queries into MapReduce jobs -- to query Hadoop data, or to get data into and out of Hadoop, qualifies as "YARN-ready." However, a YARN-ready application doesn't have fine-grained control over scheduling, resource use, parallelization, and other aspects of Hadoop performance.
"The number one thing is that [running as a native YARN application] eliminates 100 percent of the programming code that runs on Hadoop. If your application can speak native YARN, you can write that ApplicationMaster and YARN will instantiate all of the core engines [required by] that ApplicationMaster to run in Hadoop the same way that MapReduce does."
Keeffe's citation of an ApplicationMaster is borderline technical, but -- to vastly over-simplify what's involved -- think of an ApplicationMaster as kind of like YARN's DNA. It's an encoded, highly detailed template for executing workloads in parallel. (DNA is itself an encoded template for protein synthesis, which is the engine of cell regeneration and growth.)
Keeffe's point is that writing to (or using a YARN-native app such as RedPoint to automatically generate) an ApplicationMaster eliminates the need for highly specialized, domain-specific coding skills (e.g., coding data engineering-specific transformations -- such as directed acyclic graphs -- in Java or Pig) and likewise simplifies the process of scheduling workloads to run in optimized engines such as Tez or Slider.
"Without an ApplicationMaster, the point is that it's not a native YARN application, so there will be coding that needs to be done -- either coding that's generated by your apps and processes in Hadoop or coding that you have to write yourself in Python, Scala, or other [languages]."