A Closer Look: Splice Machine's ACID-Compliant Database for Hadoop
Splice Machine exposes a full ANSI SQL query facility for HBase, along with support for indices, database triggers, and other data management essentials. It's ACID-compliant, too.
- By Stephen Swoyer
- January 28, 2014
By itself, Hadoop makes for a less-than-ideal database platform. Silicon Valley start-up Splice Machine Inc. aims to change that.
It markets what it says is an ANSI-standard, ACID-compliant database for Hadoop. Splice Machine claims to expose a full ANSI SQL query facility for HBase (Hadoop's hierarchical-like database store), along with support for primary and secondary indices, database triggers, and other data management (DM) essentials. Oh yes: it claims to support transactional consistency, too.
Vanilla Hadoop -- i.e., the Hadoop framework, HBase, Hive (a SQL-like interpreter for Hadoop MapReduce), and the Hadoop Distributed File System (HDFS) -- doesn't support any of this.
"We are the only one who put a transactional semantics on a SQL engine on Hadoop," Splice Machine chairman and CEO Monte Zweben told BI This Week at this year's Strata + Hadoop World conference. "A lot of people think about this in terms of debits and credits and orders, but this is crucial for analytics as well. Anyone doing an online analytical app, like a dashboard, or something where you're trying to look at the real-time analytics of some domain, they have indexes in their database because they need to look at data from multiple dimensions. We support secondary indices ... we can update data and index in one transaction consistently, so when you enter a new order, indexed by ZIP code as well as by product ID and transaction ID, you can update simultaneously."
Splice Machine is one of several upstart entrants in what might be called the Next-Gen Distributed Database Stakes. Other entrants include Nuo DB Inc., which developed its distributed DBMS platform from scratch, Foundation DB, and Google Inc.'s F1, which "debuted" just this year. (F1 is by no means new: it's based on Google's Spanner distributed database project. What's more, Google now uses F1 to power its Adwords service. As a combined NoSQL + RDBMS platform that's ACID compliant -- with support for two-phased commits -- and ANSI SQL-ready, F1 seems especially intriguing.)
At this year's Strata + Hadoop World show, several other players (such as Hadapt and GridGain Systems Inc.) also touted distributed DBMS- or RDBMS-like offerings. Meanwhile, the Hadoop community is working feverishly to flesh out Hadoop's DM feature set, which involves bringing SQL query to HBase, making Hive fully SQL-compliant (and implementing support for transactional consistency, database subqueries, and so on), and beefing up HCatalog, the metadata services catalog for Hadoop. Several projects -- such as "Stinger" (an effort helmed by Hortonworks Inc. to beef up Hive); "Impala" (a project supported by Cloudera Inc. that supports interactive or ad hoc queries on Hadoop); EMC/Pivotal's "Hawq" (a port of the Greenplum database to run on top of Hadoop); or the full-fledged decision-support-on-Hadoop platforms marketed by DataMeer Inc. and Platfora Inc. -- aim to recast Hadoop as a platform for traditional decision support workloads. (There's also crucial ongoing work in other Hadoop-related projects, such as Cascading.)
Splice Machine, Zweben claims, achieves just this today.. "Here [at Strata], you hear a lot about SQL-on-Hadoop. The majority -- or really all -- of those solutions are targeted to the data scientist, which is one possible constituency, but there's another, even more important constituency. The person who needs to do ad hoc queries, for example: that's an important constituency that we serve, too."
Vanilla Hadoop is a strictly sequential, batch-oriented platform, Zweben points out. Splice Machine is not. "What this means is that somebody can interact with an app on our database and change records on the database in real time, not just [perform] big batch loads in real time," he continues. "They can change Monte's record, delete Monte's record, be able to change Monte's record and all of his line items on that order consistently in one transaction -- or roll back the whole thing."
Transactionally, Splice Machine implements a technique called "snapshot isolation," which is used in Oracle (where it's called, somewhat confusingly, "serializable mode"), SQL Server, and PostgreSQL, among other DBMSes. "Every write happens in the database as a time stamp and [Splice Machine] never locks any of the records that are being read -- the writes don't become visible until the transaction commits. This makes it so that the database can be very fast and concurrent," Zweben explains.
This works well enough for simple transactions (such as credit or debit), but what about more complicated transactions -- such as (for example) those involving simultaneous debits and credits? In this case, Splice Machine -- like any other distributed database -- is susceptible to skewing. Vendors and IT organizations usually try to correct for skew by tightly controlling the physical and/or virtual environments in which a DBMS is deployed (e.g., by isolating nodes on discrete physical systems, as distinct to virtualizing several "nodes" on a single system; by using direct-attached as distinct to network, shared, or pooled storage; by standardizing cluster hardware to ensure consistent performance from node to node; by using a high-bandwidth, low-latency transport backbone, such as InfiniBand). This is easier to do in traditional enterprise database or massively parallel processing (MPP) deployments -- i.e., those involving single-system nodes -- than in highly virtualized schemes.
"We would be victim to the same issues of skew [as would any other distributed database], where you would kind of have to hold a record and update it," he concedes.
Zweben contrasts Splice Machine's model with that of NuoDB, which effectively built its architecture from scratch. "They [NuoDB] do have a similar snapshot isolation approach. We can handle a transactional context, but we would be victim to a serializable isolation just like they would."
"They're a nice architecture, but I'd have to say that the difference is [that] we decided to start on a proven scaled architecture which is Hadoop, rather than building our own," he points out, stressing, however, that "many of our customers are not looking for Hadoop solutions; they're looking for scale-out solutions for their database ... and they don't really care that it's on Hadoop. All that they care about is that it can scale beyond terabytes and distribute data automatically across the cluster."