DataStax: Anything Hadoop Can Do Cassandra Can Do Better
Innovation by NoSQL players like DataStax challenges the BI status quo.
- By Stephen Swoyer
- August 20, 2013
Several dozen vendors exhibited at last month's O'Reilly Open Source Convention (OSCon) in Portland. Only one vendor showed up for both OSCon and the Pacific Northwest BI Summit, however.
That would be DataStax, which markets DataStax Enterprise, a NoSQL platform that bundles the Apache Cassandra distributed database, the ubiquitous Hadoop stack, and Solr, an analytic search facility based on the open source software (OSS) Lucene project.
DataStax seemed perfectly at home in OSCon's Carnival-like atmosphere, where its Hadoop-centric competitors -- Cloudera Inc., Hortonworks Inc., MapR Technologies Inc., and Pivotal -- were also in attendance, along with a slew of big data-oriented start-ups.
On the other hand, DataStax was the only big data best-of-breed represented at the 12th annual Pacific Northwest BI Summit, held during the same week in Grant's Pass, Ore., 270 miles to the south. In this setting, DataStax's take on business intelligence (BI) and decision support, to say nothing of its vision for What Comes Next, contrasted markedly with those of other attendees, the bulk of whom represented established BI powers, some 14 in all.
Making the Case for Cassandra
Compared with Hadoop, you don't hear as much about Cassandra.
This seems inconceivable to Lara Shackelford, vice president of marketing with DataStax: Give Shackelford an opening and she'll tick off an exhaustive tally of all of the reasons why Cassandra makes for a better OLTP and analytic platform than Hadoop -- or any other NoSQL competitor, for that matter.
"One of the largest grocery chains in the world is working with us to try to figure out how to drive more people into their supermarkets. They want to offer an app in the App Store, but they want to be able to target customers with offers or promotions that will appeal to them. They're using us for analytics," she explains. "They use us on-premises and in the cloud. That's one of our advantages: we can distribute [the same instance of Cassandra] across both [contexts]. You can't easily do that with Hadoop."
Cassandra, she argues, has robust fault tolerance. It doesn't simply protect against data loss or corruption -- which is what the Hadoop distributed file system (HDFS) does when it copies a "block" of data (essentially, a decomposed piece of a file) in triplicate across a Hadoop cluster. Instead, it replicates blocks to multiple nodes and supports replication between geographically distributed nodes. Hadoop's lack of support for robust fault-tolerance is a known problem: solutions do exist, but they tend to be half-measures (e.g., a "warm standby" capability) or various vendor-specific implementations.
At the TDWI World Conference in Las Vegas in February, for example, a representative with a prominent North American insurance company expressed his frustration with Hadoop's high-availability story. Because of his company's policies, this attendee said he couldn't deploy Hadoop in the data center. He reported getting unsatisfactory answers from all of the Hadoop vendors at the conference. "None of them [Hadoop vendors] has an answer for this," he told BI This Week.
Second, there's the Cassandra File System (CFS), which Shackelford says offers several advantages over HDFS. Depending on whom you ask, this might amount to faint praise. HDFS has no shortage of detractors, particularly among data management (DM) practitioners. (As a general-purpose system for reading and writing files, it's good enough; as a file system for reading and writing data bits, it's much less adept.) CFS uses a peer-to-peer (or "ring") architecture, as distinct to HDFS' master-slave scheme; this is key to its resilience and fault-tolerance.
CFS has other advantages, too, she argues. For example, HDFS is optimized for large file sizes; filling up HDFS with lots of small files can negatively impact performance. If required, Hadoop will spin up four separate MapReduce jobs to process four separate 11 MB files; this can increase latency and squander system resources. (Lots of small files can more quickly max out the Hadoop NameSpace index, too.) CFS doesn't have this limitation, Shackelford points out. In addition, she notes, Cassandra automatically handles replication and failover -- "failover" is a function of massive distribution and redundancy -- so administrators don't have to configure master-slave failover schemes.
She cites another not-so-obvious advantage of using Cassandra: CFS implements the HDFS API, which lets it support the Hadoop DM stack. In other words, Hadoop-based tools or services will run without modification on Cassandra and CFS. This leads to an Ockham's Razor-type of problem, however: if DataStax relies to a large extent on the Hadoop stack for its analytic component -- and it does -- why not eschew it altogether and run Hadoop, HDFS, and the Hadoop constellation of projects?
The answer has to do with the advantages already outlined, Shackelford argues. Cassandra scales better, distributes better, is fault-tolerant, and boasts a superior file system layer. This makes it a faster, more scalable, and more resilient platform for hosting everything from the MapReduce engine to projects such as Hive (a SQL-like semantic layer for HDFS), Mahout (a predictive analytic/machine learning facility for Hadoop), and others. (Because Cassandra is itself a hierarchical record store, the choice to use HBase depends on user preference or application requirements.)
In addition, Shackelford says, DataStax Enterprise edition bundles Solr, a content search and indexing facility. Solr supports hardware vector processing and implements algorithmic search capabilities, so it's really more of an analytic search facility. This gives DataStax a built-in analytic discovery service, Shackelford argues. (Add in OSS projects such as OpenNLP or Mahout and Solr can support natural language processing, too.)
Of course, from a traditional business intelligence (BI) perspective, the problem with Cassandra and other NoSQL platforms is that none of them is ideal for the kinds of workloads used in BI and decision support. BI workloads consist of joins and bulk operations for which the NoSQL platforms simply weren't designed, let alone optimized. Cassandra, for example, doesn't support joins -- although joins can be parallelized using the MapReduce engine running on CFS. (Depending on the type of workload, however, this can require extremely complex Java/Pig Latin coding.)
To the extent that BI is "done" on any of these platforms, then, it requires the use of tools that (from the perspective of many BI practitioners) amount to kludges: e.g., projects such as Hive, HCatalog -- a rudimentary metadata catalog service for Hadoop -- and others. For most BI workloads, querying against Hive is going to be much slower than querying against a dedicated analytic DBMS.
The Forgotten Cloud
The reverse of this coin is that the conceptual or architectural assumptions which underpin BI and decision support simply don't lend themselves to the real world -- i.e., to the ways in which information is increasingly generated, managed, analyzed, and consumed. (The DW, for example, is predicated on a pair of unrealistic assumptions: first, that requirements can be known/modeled in advance; second, that requirements won't significantly change.)
A more recent wrinkle is the claim that the BI usage and consumption model is out of step with the evolution of the rest of IT. BI, some detractors claim, is still grounded in a client-server architecture that's closing in on 40 years old. Almost all BI vendors "have" cloud strategies, to be sure, but most of these want to embrace the cloud by co-opting it -- i.e., by transplanting an existing model into a hosted context. They're touting a software-as-a-service (SaaS) spin on cloud -- e.g., a customer buys a subscription for a domain-specific service (CRM, reporting, or ETL) -- even as hosting providers and enterprise IT organizations are shifting to platform- or infrastructure-as-a-service (PaaS or IaaS) models.
This is Shackelford's trump card. "Because our architecture is designed to be massively distributed, it's a great solution for the cloud. One of our core things we enable is the [geographically] distributed data center," she explains. "We have one of the biggest retail companies in the world [that] runs us in about seven different locations. They came to us because they had an issue on Black Friday, and their Oracle system had let them down," Shackelford continues. "It wasn't even a question: DataStax was just so much cheaper than what it would have cost them to achieve the same [kind of availability] in Oracle."
She points to the OSCon experience, which was dominated by cloud vendors; most were marketing PaaS offerings based on OpenStack, CloudStack, or Eucalyptus. These are OSS projects that aim to deliver a feature-complete cloud stack. The allure of such a stack is the promise of portability via open APIs: i.e., the ability to provision and move cloud instances from one context to another (be it intra-provider or inter-provider) with a few clicks of a mouse-button -- or a few swipes of a touchscreen, for that matter.
Some vendors -- such as ActiveState, which markets an IaaS platform called Stackato, or Red Hat, which markets OpenShift Enterprise -- even claim to support private PaaS implementations. This makes it possible to shift running instances from an on-premises PaaS implementation to a public PaaS provider (or vice versa) and to simultaneously host both.
These solutions are still incubating; as a function of technological limitations, people/process intractabilities, and service provider in-fighting, it's likely that truly portable PaaS will fall (far) short of this vision. The point is that IT and application development outside of BI is trying to build an architecture for next-generation application delivery, management, and consumption -- be it in an OSS context (with OpenStack or CloudStack), with OSS-friendly commercial providers (such as DreamHost, which supports OpenStack APIs; Red Hat Inc., which develops OpenShift Origin; or VMWare Inc., which develops CloudFoundry), or with Amazon Web Services (AWS), the 800-pound gorilla of IaaS that's viewed with fear, trembling, and even a kind of awe by both competitors and potential partners alike.
At OSCon, a CIO with a prominent gaming vendor told BI This Week that it's difficult not to optimize for AWS in architecting for the cloud. Even if a company wants to make its services available for multiple cloud platforms, the prominence and feature set of AWS makes it an extremely attractive target platform. Interestingly, Eucalyptus from Eucalyptus Systems Inc. aims to provide an OSS infrastructure for building AWS-compatible private PaaS environments. It's even possible to move cloud instances between Eucalyptus and AWS. (Elsewhere, Apache CloudStack also supports AWS APIs; OpenShift and CloudFoundry both run in AWS.)
There's One Big Catch, however. Just as BI and DW workloads don't easily lend themselves to processing via NoSQL, they likewise can't easily be shifted or transplanted into the massively distributed context of the cloud. More precisely, BI and DW workloads as presently constituted can't easily be shifted or transplanted into the cloud. However, DataStax and other NoSQL players -- including Cloudera, Hortonworks, and MapR, along with untraditional players such as Cloudant (which markets a distributed database based on CouchDB), Datameer Inc., Platfora Inc., and others -- are working to address this.
There's a flip side to this coin, too. If BI and DW workloads (as presently constituted) can't easily be shifted into the cloud, NoSQL data processing platforms -- as presently constituted -- aren't an ideal fit for traditional BI or DW workloads. They're better suited for specific use cases (such as data staging and data preparation), as well as for certain kinds of analytic workloads (especially those involving "multi-structured" -- viz., text, voice, video, and other kinds of not-so-structured -- data). It's possible to view these use cases as the thin end of an inevitable wedge, however: workloads will change, expectations will change; delivery and consumption models will change – and so, too, will the capabilities of NoSQL and other emerging technologies.
Today, for example, most of the NoSQL platforms, along with offerings from vendors such as Actuate Corp. (which markets the BI Reporting Tool, or BIRT), JasperSoft Inc. and Talend, are available via PaaS offerings from Amazon, Red Hat, VMWare, ActiveState, and others. In many cases, they can be installed (if not configured) from an App Store-like Web storefront. From the perspective of traditional BI and DW, these implementations are far from ideal. As customers see it, however, they're good enough -- whether it's for greenfield BI deployments; for seasonal, one-off, or unexpected business or project requirements; for localized -- i.e., workgroup- or business unit-specific -- needs; for developing, testing, or prototyping BI applications and services; and so on.
According to Shackelford, they're only going to get better. "All of this [innovation] has happened so quickly. If you look at Cassandra or even Hadoop today, there's so much more to them than there was even two years ago. We're continuously innovating, the open source community is continuously innovating. Everything has come so far so fast."