Pivotal's Hadoop-based Data Management Stack Coming Rapidly into Focus
- By Stephen Swoyer
- January 7, 2014
Early in 2013, EMC Corp. spin-off Pivotal started shipping Hawq, an implementation of its Greenplum massively parallel processing (MPP) DBMS for Hadoop. Pivotal positioned Hawq as the centerpiece of Pivotal HD, its proprietary Hadoop distribution.
At last year's Strata + Hadoop World conference, Pivotal flanked Hawq with two complementary offerings: Pivotal Data Dispatch (which it bills as a data discovery facility for Hadoop) and GemFire XD (an in-memory database technology first acquired by VMWare Inc.).
Industry veteran Dave Menninger, head of business development and strategy with EMC Greenplum, describes GemFire XD as an in-memory database cache for Hawq, which -- like the Greenplum MPP database (and other MPP engines, for that matter) -- isn't a real-time database.
"Think of [GemFire] XD as in-memory database cache and Hadoop as the persistence for that information," Menninger explained. "We have capital markets that use this technology [because] it's for very low-latency rapid ingestion and analysis of information. The other key market besides financial services communications -- telcos, for example -- want to be able to capture streaming, real-time events, [such as] telemetry information, market data, and communications network information."
Data Dispatch, on the other hand, is designed to support a Hadoop "landing zone" use case. This describes a scenario in which an organization uses Hadoop as a landing, consolidation, and staging area for enterprise information. "This is technology for managing Hadoop-based data landing zones or data lakes. Organizations are inserting Hadoop as a collection point for all of the data that's being generated in their organizations. From that data landing zone or data lake, they're populating their data marts or data warehouses, or they're creating sandboxes, because you can typically collect that information in its raw form," he explained.
"We see the concept of a data landing zone emerging and Pivotal Data Dispatch is a tool for managing a collection of information into that landing zone and then the movement of that data through the data marts and data warehouses."
Data Dispatch also enables what Menninger called a "data lease" model: "You subscribe to or lease the information, and at the end of a lease, your access is terminated," he explained.
Pivotal's DM Strategy Comes into Focus
Hawq, Data Dispatch, and GemFire XD are a few of several DM-related products Pivotal delivered in 2013 for its Pivotal HD Hadoop distribution. (Another important deliverable is Spring XD, which Pivotal bills as an application development framework for big data.) At the same time, Pivotal has tweaked its messaging: when it first launched HD, for example, it sought to trumpet Hawq's massive performance advantage (up to 600x faster) with respect to Hive, the SQL-like interpreter for Hadoop that suffers from poor performance relative to MPP RDBMS engines.
Contextually, this made sense: months earlier, Pivotal competitor Cloudera Inc. had announced a new interactive query facility (Impala) for Hadoop, and -- at the same Strata 2013 event at which Pivotal announced Hawq -- Hortonworks Inc. unveiled "Stinger," its effort to improve both the performance and the flexibility of standard Apache Hive.
At Strata + Hadoop World, however, Menninger struck a more pragmatic chord, describing Hawq as (in effect) Hadoop made safe for data management (DM) practitioners.
He placed less emphasis on Hawq's performance -- which he dutifully described as "two orders of magnitude better than Hive" -- and much more on its DM feature set. Hive, Menninger claimed, currently lacks support for key SQL and RDBMS amenities -- including database transactions, subqueries, and materialized views.
By contrast, he pointed out, Hawq was arguably the first ACID-compliant, ANSI-standard SQL RDBMS implementation for Hadoop. Over the last 18 months, the open source software (OSS) Apache community and several commercial software vendors have invested a staggering amount of money and effort to help shore up Hadoop's DM feature set. In spite of this, Menninger claimed, Hawq remains the most SQL-savvy of Hadoop DBMS technologies.
This matters to enterprise IT organizations, he asserted. "If you're a more traditional enterprise, you have huge investments in SQL-based skills. Hawq is the best of both worlds. The data doesn't move. The data is in Hadoop. [HDFS] is the underlying storage mechanism. The results of your [MapReduce] analyses or just your data collection are immediately available to people with a SQL-based tool or SQL knowledge," he said.
"The fundamental issue with Hadoop is not its power, not its capability -- it's accessing the information that's in Hadoop. What Hawq does is it brings the entire world of SQL -- skills, knowledge, tools -- to the Hadoop community."