Analysis: What's All the Hadoop-la? EMC Breaks Out with Pivotal HD
EMC Corp. arguably stole the show at last month's Strata 2013 conference, announcing the equivalent of a SQL-compliant, high-performance RDBMS running on top of Hadoop: Pivotal HD.
- By Stephen Swoyer
- March 12, 2013
EMC bills Pivotal HD as "the world's most powerful" Hadoop distribution.
As claims go, that's quite bold. With Pivotal HD, the company says it's actually scaling its Greenplum massively parallel processing (MPP) database across the Hadoop Distributed File System (HDFS).
In other words, Greenplum is running on top of Hadoop.
This is EMC's long-incubating "Hawq" technology. Unlike several popular open source software (OSS) projects, Hawq isn't an RDBMS overlay for Hadoop. Instead, it's a full-fledged, SQL-compliant RDBMS running on top of HDFS. That's new and, arguably, unprecedented.
"This is a full Greenplum [MPP] database running on top of Hadoop. It is fully ACID-compliant. Basically, what this means is that customers can take the SQL-based tools that exist today that work for Greenplum and they get Hadoop for free," says Dave Menninger, head of business development and strategy at Greenplum. He says EMC ported Greenplum's data loaders, too. "We abstracted out the storage layer and the catalog layer ... and [Hawq] is the first application of that abstraction. You could see where that abstraction could lead to ... supporting other types of data stores as well."
Pivotal HD in Context
Integration between Greenplum and Hadoop is anything but new. The former Greenplum Software Inc. was one of the first vendors to cozy up to Apache Hadoop half a decade ago, when it announced tentative support for HDFS. Several months later, Greenplum unveiled a native, in-database implementation of MapReduce. In May of 2011, EMC -- which had purchased Greenplum the previous June -- announced the availability of EMC Greenplum Community Edition (CE) and EMC Greenplum HD Enterprise Edition. EMC's Greenplum CE was based on the open source Apache Hadoop stack; its HD Enterprise Edition variant used MapR Inc.'s proprietary distribution of Hadoop.
EMC's Hawq technology begs an obvious question, however: doesn't Hadoop already have a SQL-like query technology? Isn't that the purpose of Hive and HQL?
Sort of and yes. At the TDWI World Conference last month in Las Vegas, several data warehouse vendors and at least one Hadoop vendor -- Hortonworks Inc. -- spent time talking about Hive's many problems. According to a principal with a prominent data warehousing software vendor (who spoke on condition of anonymity and whose employer supports Hive and HQL in its DW software tools), Hive, HQL, and HCatalog (the latter an attempt to retrofit a kind of metadata layer on top of Hive) are plagued by poor performance and inconsistency.
"We can go out and grab something from Hive; we can run some SQL in there -- [and] we can have no idea that we've actually missed half the stuff that we should have got," this person said, adding, "There's no schema enforcement in Hive or Hcatalog."
Jim Walker, director of product marketing with Hadoop specialist Hortonworks Inc., took an opposite tack, choosing instead to promote several ongoing initiatives designed to help improve Hive's performance. One of these, he says, is Hortonworks' new "Stinger" interactive query feature for Hive, which the company announced at the TDWI conference.
"The idea with Stinger is to improve Hive to become more interactive,. We've implemented a number of different things to speed up Hive. What we're seeing is a 100x increase in performance."
EMC claims a much bigger performance boost. (See below.) Nevertheless, Menninger believes there's still a role for Hive in Pivotal HD environments. It just depends on the priorities of the customer, he says. "We do think that there's a Hadoop community that doesn't care to speak SQL, just as there's a SQL world that doesn't know too much about the Hadoop world," he comments. Customers have the option of licensing the Hawq SQL component, Menninger points out – and some likely will decide not to do so. "In both of those cases, Hive will still be relevant; however, if you look at it as on a percentage-of-users basis, the larger percentage of users will probably utilize SQL exclusively."
Hawq Explained
What does Hawq do that Hive doesn't? For one thing, EMC says, it isn't an overlay, abstraction layer, or translation layer. Instead, it's the equivalent of an MPP RDBMS database -- with column-store support -- running on top of HDFS. It likewise achieves that Holy Grail of Hadoop and NoSQL: ACID compliance.
Hawq, EMC claims, is the product of years of development; it's essentially a SQL-compliant data store that boasts full SQL-92 and SQL-99 compliance -- with support for SQL-2003's OLAP extensions. Hawq supports ODBC and JDBC connectivity; connectivity to other sources -- such as flat files stored in HDFS -- will be enabled by means of the Greenplum Extension Framework (GPXF).
Hawq also has native connectivity to HBase, which is a popular column-store-like technology for Hadoop.
According to EMC, Pivotal HD and Hawq achieve a significant performance improvement over vanilla Hive.
Whereas Hortonworks claims a 100x increase in performance versus vanilla Hive, EMC (with Pivotal HD and Hawq) claims a 600x improvement. On the other hand, Hortonworks prides itself on its open source "purity." In discussing Cloudera's "Impala," a real-time query facility for Hadoop (which competes against Hortonworks' own "Stinger" technology) Walker stressed that the Cloudera technology is proprietary. So, too, is EMC's Hawq.
Given what's at stake, purity might not matter. Especially if Greenplum solutions architect Dr. Donald Miner's description of Pivotal HD and Hawq as "a near-real-time analytical SQL database that runs on Hadoop" pans out. (Miner makes this claim on his EMC blog.)
To the degree that EMC can deliver on Miner's vision -- or on something close to his vision -- it will have effectively exploded one of the Iron Laws of traditional data warehousing. After all, one of the most salient differences between big data analytics and traditional data warehouse-driven analytics practices has to do with real-timeliness, says industry luminary Claudia Imhoff, president of information management consultancy Intelligent Solutions Inc.
"The part [that big data has] really disrupted is the ability to do real-time analytics: we cannot in any way, shape, or form analyze real-time data in the data warehouse. It cannot be done," she says.
Imhoff was speaking about a traditional data warehouse in a traditional hub-and-spoke architecture. As she herself has observed, this architecture is typically being augmented by big data analytic solutions -- as well as by other disruptive technologies. Some vendors, such as Teradata Inc., even tout a kind of Grand Unification architecture. With its Unified Data Architecture (UDA), for example, Teradata claims to offer a single platform for decision support, ad hoc query and analysis, information discovery, advanced analytics, and big data analytics. EMC can't yet claim to address all of UDA's use cases. However, with Pivotal HD, it does claim to offer a single platform -- based on Hadoop -- for both traditional data warehousing and big data analytics.
In a follow-up article, we'll take a closer look at some of the other news items to come out of last month's Strata 2013 show -- including Intel Corp.'s long-awaited (and x64-optimized) Hadoop distribution and the Win64 version of Hadoop that both Microsoft Corp. and Hortonworks unveiled.