A Platform for All Data, Big and Small
In an era of ever-expanding data volumes and heterogenous data-processing requirements, Vertica really shines, HP officials maintain.
- By Stephen Swoyer
- November 3, 2015
The recent Strata + Hadoop World Conference was packed with a panoply of players, many of them start-ups or upstarts. In this context, established powers such as Hewlett-Packard Co. (HP), Informatica Corp., Microsoft Corp., Oracle Corp., SAS Institute Inc., and Teradata Corp. might have seemed like the odd vendors out.
The Strata + Hadoop World tent isn't just a big one, it's an ever-expanding one, too. On the massive expo hall floor, these and other established vendors -- systems management specialist BMC Software Inc. was there, too, as were Cisco Systems Inc. and Dell Inc. -- seemed no stiffer or squarer than, say, Cloudera Inc., Hortonworks Inc., and MapR Technologies Inc., all of which have been in business for a few years. In other words, all of these vendors seemed a little stiff, at least in the midst of that scrappy horde of upstarts, start-ups, and would-be disrupters.
Stiff, maybe, but certainly not fazed. Take HP, which was on hand to trumpet its array of big data-oriented products and services. One of HP's core big data offerings is its Vertica massively parallel processing (MPP), columnar database, which turned 10 this year. (Database design maestro and start-up maven Michael Stonebraker founded Vertica in 2005; HP purchased it almost five years ago, early in 2011.) Coincidentally, the Hadoop platform itself turned 10 in 2015. Vertica, like Hadoop, can scale to support petabytes of data. Vertica, like Hadoop, can ingest, store, and process semi-structured data, such as text.
What's more, Vertica can also query directly against data stored in several of Hadoop's columnar storage formats, viz., Parquet and Optimized Row Columnar (ORC) files. To be sure, Hadoop and the Apache Spark cluster computing framework can do things Vertica can't do, such as ingest, store, and process non-relational multi-structured data (such as file objects of any kind), but Vertica itself can do something Hadoop can't do, argued Steve Sarsfield, product marketing manager with HP Vertica, earlier this year: it's a query processing platform par excellence.
This has a lot to do with its MPP database underpinnings as well as its design and optimization for decision support and data warehousing workloads.
"Vertica is a relational database, and, like any relational database, it has columns and tables, it speaks SQL, and it connects to other data sources via things like ODBC. It connects and talks to business intelligence tools, in addition to self-service tools. Tableau, Looker, Qlik, all of that stuff works very well with Vertica," Sarsfield said.
"For some people, this is unsexy stuff, but Vertica can query against data in Hadoop, so if you want to add Hadoop nodes and make use of them, that's part of the equation. You can easily do that, and Vertica has built-in capabilities [i.e., functions and algorithms] for analyzing text and unstructured data. Most important, Vertica is an MPP database, so it was designed for processing data at massive scale."
According to Sarsfield, Vertica also has an advantage when it comes to storing data at big-data scale -- or, at least, Very Large Volumes of data. "We developed our own data compression algorithms, so we have excellent data compression and we're able to achieve excellent efficiency. Vertica can analyze the data [it's ingesting] and decide which algorithm offers the best compression [for storing it]," he explained.
In most cases, compression entails a performance trade-off of some kind, but a columnar architecture can help mitigate this trade-off to some extent, Sarsfield argued. "Vertica also supports what is called 'late materialization,' so it can actually perform [in-memory] operations on compressed data without uncompressing it. This can significantly increase performance," he argued.
Super-charging the Data Warehouse -- And NoSQL Analytics
Sarsfield described Vertica as the equivalent of a "super-charger" for aging data warehouse systems. In this respect, he noted, reports of the data warehouse's death have been greatly exaggerated. Indeed, the Strata + Hadoop World expo hall all but teemed with would-be data warehouse replacements, most of which are designed to run on Hadoop or Apache Spark. (Examples include relative newcomer AtScale Inc., along with established players such as DataMeer Inc., and Platfora Inc. AtScale explicitly markets itself as an OLAP technology for Hadoop/Hive. Research analyst Mark Madsen, a principal with Third Nature Inc., once dubbed Platfora "PlatfOrLAP" for similar reasons.)
What's more, vendors such as Cloudera, Databricks Inc. (the commercial entity behind Spark), Hortonworks, and MapR, along with IBM Corp. and Teradata, have invested significantly in shoring up Hadoop's ANSI SQL bona fides. Cloudera via its investments in Impala, an interactive SQL-like query engine for Hadoop; Hortonworks via its work with Hive (a SQL interpreter for Hadoop) and Tez (a replacement for Hadoop's MapReduce engine that supports interactive processing); MapR via its work with Drill, the open source implementation of Google Inc.'s Dremel distributed query technology; Databricks and IBM via their investments in Spark (which has its own SQL variant, Spark SQL); and Teradata via its investment in Presto, a SQL query engine for Hadoop. If the data warehouse as an institution is dying, data warehouse architecture -- as a conceptual framework -- is alive and well.
Optimized platforms such as Vertica could be considered "better than data warehouse data warehouse systems," Sarsfield argued. In point of fact, there are several extant optimized MPP database platforms. These include Actian's Matrix, EMC's Greenplum, IBM's Netezza, Microsoft's SQL Server Parallel Data Warehouse, and Teradata's Aster Appliance. All of these systems address traditional ad hoc query and analysis requirements; interoperate with both traditional and newer self-service BI tools; can scale to big-data volumes; and can store, process, and analyze non-traditional data formats, including non-relational multi-structured data.
Most can also access and query against data stored in Hadoop.
HP and other vendors aren't just marketing these platforms for SQL-based analytics but for NoSQL analytics, too. Teradata is especially vociferous in this regard. Even though it resells a Hadoop appliance, Teradata insists that either its Aster Discovery platform or its Teradata database -- or both -- can cost-effectively perform most if not all of the same non-relational analytical workloads as Hadoop and/or Spark.
Sarsfield acknowledged that some customers think it's neither cost-effective nor practicable to run NoSQL analytics in a nominally SQL platform.
HP is working to convince them otherwise, he said. "In Hadoop, you don't have all of the analytical functions and algorithms you have in Vertica. Hadoop is going to be slower [than Vertica] for most of these [non-relational workloads], too. On top of this, you won't have nearly the same query[-processing] performance in Hadoop that you get in Vertica. You won't have the same governance, the same security, [or] the same service levels. Hadoop can't support high concurrency. Hadoop has poor [support for] metadata [management] and [data] lineage," Sarsfield maintained.
"Hadoop is a cost-effective alternative [to the data warehouse] for [storing] certain kinds of data and [processing] certain kinds of workloads," he continued
Sarsfield argued, however, that Vertica's ability to access, ingest, and/or query against data in Hadoop permits an organization to deploy a cost-effective hybrid platform. "It's using the platform that's best for your needs. For ad hoc query, for decision support, BI tool access, self-service access, [an MPP database platform such as] Vertica is almost always going to be the best solution. For time-series and other kinds of advanced analytics, it's going to be better," he noted.