MapReduce, Parallel Processing, and Pipelining: A Tech Primer
Over the last five years, many DBMS vendors have introduced native or in-database implementations of MapReduce, a popular parallel programming model for distributed computing. One key difference is that MapReduce in the data management world tends to speak a very different programming language. There are plenty of other differences, too.
By Stephen Swoyer
Analytic database platforms first implemented support for MapReduce, a parallel programming model for distributed computing, half a decade ago.
In the open source world of the Apache Software Foundation's Hadoop framework, programmers code MapReduce jobs in Java, Pig Latin, Perl, or Python.
Collectively, these languages constitute a kind of lingua MapReduce. This makes sense, given the programmer-centric pedigree of Hadoop MapReduce.
Things are understandably different in the data management (DM) world, where Java -- to say nothing of Pig Latin, Perl, or Python -- are far from a lingua franca.
Back in September of 2008, two analytic database specialists -- the former Aster Data Systems Inc. and the former Greenplum Software Inc. -- announced native support for MapReduce. The in-database MapReduce implementations touted by both Aster and Greenplum focused on pairing MPP (massive parallel processing, used for relational workloads) with MapReduce (for basically everything else). At the time, Hadoop was then just a few years old. It was being touted as a general-purpose parallel processing platform -- much like MPP. Part of its claim (for general purpose use) had to do with the fact that developers could write MapReduce jobs in popular languages such as Java or Perl without having to know SQL.
In the DM world, MPP is used to power highly scalable data warehouse systems. It's old hat. Developers don't work with SQL or -- for that matter -- relational theory. From a programming perspective, Hadoop MapReduce was revolutionary.
From a DM perspective, it was less so. Nevertheless, over the last five years, most other analytic DBMS players have introduced in-database support for MapReduce. One key difference is that MapReduce in the DM world tends to be spoken with a SQL-savvy patois. In other words, MapReduce jobs aren't just coded in Java or C++: instead, they're a mix of these procedural languages and SQL. Teradata Aster, for example, touts what it calls "SQL/MapReduce," a marriage of SQL and MapReduce that permits developers to write UDFs (user-defined functions) in supported programming languages and insert them into traditional SQL queries.
The beauty of this scheme is that the Aster MPP engine handles SQL queries and uses its in-database implementation of MapReduce to parallelize non-relational tasks. Aster likewise offers several dozen canned SQL/MapReduce UDFs for advanced analytic workloads. The allure of in-database MapReduce -- whether it's implemented by Aster (now owned by Teradata Corp.), Greenplum (now owned by EMC Corp.), Netezza (now owned by IBM Corp.), ParAccel Inc., or many other vendors -- is that typically technologists can use languages other than Java, Perl, or Python. They can use C, C++, or even C#, as well as SQL.
There are other important differences, too. Hadoop MapReduce is a sequential, synchronous beast; right now, it doesn't support pipelining, which isn't just commonplace programming feature but is a necessity for many kinds of data management workloads -- particularly in an MPP context.
As a result, most in-database MapReduce implementations also implement a pipelining capability of some kind. In the data warehousing world, pipelining is important because it enables asynchronous operations. Asynchronicity is critical in an MPP architecture, argues industry veteran Mark Madsen, a principal with consultancy Third Nature Inc. "[O]therwise, you have to write out 100 percent of your output between each step in a process, and you are gated by the slowest node [in a cluster], therefore [you get] data-skew effects. Most of MPP database work is about [figuring out an] optimal data distribution to avoid skew," he points out. Lack of pipelining "also means that you can't process just-in-time," because this isn't amenable to the MapReduce model, Madsen explains.
ParAccel, for its part, implements a "MapReduce-lite" API that supports pipelining and other DM amenities. "We have a different API than the MapReduce API. It does the same thing. It's a massively parallel API ... and you can do ... the equivalent of a map phase and a reduce phase," explains Rick Glick, vice president of technology and architecture with ParAccel.
"We have kind of a 'MapReduce-lite' API. It's a bit more database-centric, it understands things like parallel pipelining a little bit better, and it understands data a little bit better. It's better at re-hashing the data and re-swiggling the data, [as well as] sorting and all of that stuff. It's a little bit more complete of an API than MapReduce."
From a DM perspective, then, Hadoop MapReduce's lack of pipelining is one of its biggest drawbacks. ParAccel's Glick sees this as a function of a Hadoop compute architecture in which HDFS (distributed storage) and MapReduce (distributed processing) are tightly coupled.
"HDFS is just the file system, but it was built with the notion of MapReduce in mind, so MapReduce would be the level where you'd do parallel pipelining, but that isn't possible," he explains, adding: "I'm not trying to be critical of the MapReduce API. I actually think it's a useful paradigm, and it really simplifies parallel programming for the masses, which I think is a useful thing, but -- like any [paradigm] -- it has its shortcomings."
In spite of its ubiquity, MapReduce is poised to get a lot less important.
The Apache YARN (Yet Another Resource Manager) project promises to decouple Hadoop from MapReduce: it will replace the Hadoop Job Tracker, which currently performs two functions: managing cluster resources and distributing MapReduce jobs. YARN will help to democratize Hadoop by making it easier to parallelize non-MapReduce jobs.
To get an idea of what the post-YARN Hadoop compute model might look like, consider DataRush, an ETL technology from Actian Inc. subsidiary Pervasive. DataRush is a scale out (massively parallel) and scale up (large-scale SMP) ETL technology. It can be deployed in or out of Hadoop; in the latter case, it will exploit Hadoop's distributed storage layer (HDFS) and use its own ETL logic in place of Hadoop's mapper and reduce functions.
"With DataRush, you don't have to think about [Hadoop] in terms of map and reduce concepts at all," claims David Inbar, senior director of big data products with Pervasive. "You decide what functions you're going to execute or need to execute -- whether it's reading, writing, joining, aggregating, clustering, executing algorithms, [or] adding your own: all of the parallelism is taken care of for you, all of the distribution is taken care of for you." YARN won't change this, he suggests.
"With YARN, we'll be plugging into the API that YARN provides so that DataRush workflows can be managed together with MapReduce and other workflows. We're not a 'foreign' job that's running across the cluster where potentially we're conflicting with a separate MapReduce job that's trying to consume."
Once YARN matures, Inbar believes that MapReduce's importance will diminish.
"MapReduce just happened to be the first coding construct that was made available in Hadoop, but it will be a legacy technology in a short time, I believe."