TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Expert Panel: What's Next in Data Integration: Powering the AI-Driven Enterprise August 25, 2025
  - Expert Panel: Improving Data Quality, Accuracy, and Consistency August 27, 2025
  - The State of Self-Service Analytics: Results from TDWI’s Latest Research September 8, 2025
  - Expert Panel: Building an AI-Driven Data Strategy September 15, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
  - Executive Summit TDWI Data & AI Leaders Summit Orlando: Governing Data, Analytics, and AI November 17, 2025
- Virtual Live Seminars
  - Data Governance Week July 30, 2025
  - Platforms & Architecture Week July 30, 2025
  - AI Bootcamp Week July 30, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

RESEARCH & RESOURCES

DataStax: Anything Hadoop Can Do Cassandra Can Do Better

Innovation by NoSQL players like DataStax challenges the BI status quo.

By Stephen Swoyer
August 20, 2013

Several dozen vendors exhibited at last month's O'Reilly Open Source Convention (OSCon) in Portland. Only one vendor showed up for both OSCon and the Pacific Northwest BI Summit, however.

That would be DataStax, which markets DataStax Enterprise, a NoSQL platform that bundles the Apache Cassandra distributed database, the ubiquitous Hadoop stack, and Solr, an analytic search facility based on the open source software (OSS) Lucene project.

DataStax seemed perfectly at home in OSCon's Carnival-like atmosphere, where its Hadoop-centric competitors -- Cloudera Inc., Hortonworks Inc., MapR Technologies Inc., and Pivotal -- were also in attendance, along with a slew of big data-oriented start-ups.

On the other hand, DataStax was the only big data best-of-breed represented at the 12th annual Pacific Northwest BI Summit, held during the same week in Grant's Pass, Ore., 270 miles to the south. In this setting, DataStax's take on business intelligence (BI) and decision support, to say nothing of its vision for What Comes Next, contrasted markedly with those of other attendees, the bulk of whom represented established BI powers, some 14 in all.

Making the Case for Cassandra

Compared with Hadoop, you don't hear as much about Cassandra.

This seems inconceivable to Lara Shackelford, vice president of marketing with DataStax: Give Shackelford an opening and she'll tick off an exhaustive tally of all of the reasons why Cassandra makes for a better OLTP and analytic platform than Hadoop -- or any other NoSQL competitor, for that matter.

"One of the largest grocery chains in the world is working with us to try to figure out how to drive more people into their supermarkets. They want to offer an app in the App Store, but they want to be able to target customers with offers or promotions that will appeal to them. They're using us for analytics," she explains. "They use us on-premises and in the cloud. That's one of our advantages: we can distribute [the same instance of Cassandra] across both [contexts]. You can't easily do that with Hadoop."

Cassandra, she argues, has robust fault tolerance. It doesn't simply protect against data loss or corruption -- which is what the Hadoop distributed file system (HDFS) does when it copies a "block" of data (essentially, a decomposed piece of a file) in triplicate across a Hadoop cluster. Instead, it replicates blocks to multiple nodes and supports replication between geographically distributed nodes. Hadoop's lack of support for robust fault-tolerance is a known problem: solutions do exist, but they tend to be half-measures (e.g., a "warm standby" capability) or various vendor-specific implementations.

At the TDWI World Conference in Las Vegas in February, for example, a representative with a prominent North American insurance company expressed his frustration with Hadoop's high-availability story. Because of his company's policies, this attendee said he couldn't deploy Hadoop in the data center. He reported getting unsatisfactory answers from all of the Hadoop vendors at the conference. "None of them [Hadoop vendors] has an answer for this," he told BI This Week.

Second, there's the Cassandra File System (CFS), which Shackelford says offers several advantages over HDFS. Depending on whom you ask, this might amount to faint praise. HDFS has no shortage of detractors, particularly among data management (DM) practitioners. (As a general-purpose system for reading and writing files, it's good enough; as a file system for reading and writing data bits, it's much less adept.) CFS uses a peer-to-peer (or "ring") architecture, as distinct to HDFS' master-slave scheme; this is key to its resilience and fault-tolerance.

CFS has other advantages, too, she argues. For example, HDFS is optimized for large file sizes; filling up HDFS with lots of small files can negatively impact performance. If required, Hadoop will spin up four separate MapReduce jobs to process four separate 11 MB files; this can increase latency and squander system resources. (Lots of small files can more quickly max out the Hadoop NameSpace index, too.) CFS doesn't have this limitation, Shackelford points out. In addition, she notes, Cassandra automatically handles replication and failover -- "failover" is a function of massive distribution and redundancy -- so administrators don't have to configure master-slave failover schemes.

She cites another not-so-obvious advantage of using Cassandra: CFS implements the HDFS API, which lets it support the Hadoop DM stack. In other words, Hadoop-based tools or services will run without modification on Cassandra and CFS. This leads to an Ockham's Razor-type of problem, however: if DataStax relies to a large extent on the Hadoop stack for its analytic component -- and it does -- why not eschew it altogether and run Hadoop, HDFS, and the Hadoop constellation of projects?

The answer has to do with the advantages already outlined, Shackelford argues. Cassandra scales better, distributes better, is fault-tolerant, and boasts a superior file system layer. This makes it a faster, more scalable, and more resilient platform for hosting everything from the MapReduce engine to projects such as Hive (a SQL-like semantic layer for HDFS), Mahout (a predictive analytic/machine learning facility for Hadoop), and others. (Because Cassandra is itself a hierarchical record store, the choice to use HBase depends on user preference or application requirements.)

In addition, Shackelford says, DataStax Enterprise edition bundles Solr, a content search and indexing facility. Solr supports hardware vector processing and implements algorithmic search capabilities, so it's really more of an analytic search facility. This gives DataStax a built-in analytic discovery service, Shackelford argues. (Add in OSS projects such as OpenNLP or Mahout and Solr can support natural language processing, too.)

Of course, from a traditional business intelligence (BI) perspective, the problem with Cassandra and other NoSQL platforms is that none of them is ideal for the kinds of workloads used in BI and decision support. BI workloads consist of joins and bulk operations for which the NoSQL platforms simply weren't designed, let alone optimized. Cassandra, for example, doesn't support joins -- although joins can be parallelized using the MapReduce engine running on CFS. (Depending on the type of workload, however, this can require extremely complex Java/Pig Latin coding.)

To the extent that BI is "done" on any of these platforms, then, it requires the use of tools that (from the perspective of many BI practitioners) amount to kludges: e.g., projects such as Hive, HCatalog -- a rudimentary metadata catalog service for Hadoop -- and others. For most BI workloads, querying against Hive is going to be much slower than querying against a dedicated analytic DBMS.

The Forgotten Cloud

The reverse of this coin is that the conceptual or architectural assumptions which underpin BI and decision support simply don't lend themselves to the real world -- i.e., to the ways in which information is increasingly generated, managed, analyzed, and consumed. (The DW, for example, is predicated on a pair of unrealistic assumptions: first, that requirements can be known/modeled in advance; second, that requirements won't significantly change.)

A more recent wrinkle is the claim that the BI usage and consumption model is out of step with the evolution of the rest of IT. BI, some detractors claim, is still grounded in a client-server architecture that's closing in on 40 years old. Almost all BI vendors "have" cloud strategies, to be sure, but most of these want to embrace the cloud by co-opting it -- i.e., by transplanting an existing model into a hosted context. They're touting a software-as-a-service (SaaS) spin on cloud -- e.g., a customer buys a subscription for a domain-specific service (CRM, reporting, or ETL) -- even as hosting providers and enterprise IT organizations are shifting to platform- or infrastructure-as-a-service (PaaS or IaaS) models.

This is Shackelford's trump card. "Because our architecture is designed to be massively distributed, it's a great solution for the cloud. One of our core things we enable is the [geographically] distributed data center," she explains. "We have one of the biggest retail companies in the world [that] runs us in about seven different locations. They came to us because they had an issue on Black Friday, and their Oracle system had let them down," Shackelford continues. "It wasn't even a question: DataStax was just so much cheaper than what it would have cost them to achieve the same [kind of availability] in Oracle."

She points to the OSCon experience, which was dominated by cloud vendors; most were marketing PaaS offerings based on OpenStack, CloudStack, or Eucalyptus. These are OSS projects that aim to deliver a feature-complete cloud stack. The allure of such a stack is the promise of portability via open APIs: i.e., the ability to provision and move cloud instances from one context to another (be it intra-provider or inter-provider) with a few clicks of a mouse-button -- or a few swipes of a touchscreen, for that matter.

Some vendors -- such as ActiveState, which markets an IaaS platform called Stackato, or Red Hat, which markets OpenShift Enterprise -- even claim to support private PaaS implementations. This makes it possible to shift running instances from an on-premises PaaS implementation to a public PaaS provider (or vice versa) and to simultaneously host both.

These solutions are still incubating; as a function of technological limitations, people/process intractabilities, and service provider in-fighting, it's likely that truly portable PaaS will fall (far) short of this vision. The point is that IT and application development outside of BI is trying to build an architecture for next-generation application delivery, management, and consumption -- be it in an OSS context (with OpenStack or CloudStack), with OSS-friendly commercial providers (such as DreamHost, which supports OpenStack APIs; Red Hat Inc., which develops OpenShift Origin; or VMWare Inc., which develops CloudFoundry), or with Amazon Web Services (AWS), the 800-pound gorilla of IaaS that's viewed with fear, trembling, and even a kind of awe by both competitors and potential partners alike.

At OSCon, a CIO with a prominent gaming vendor told BI This Week that it's difficult not to optimize for AWS in architecting for the cloud. Even if a company wants to make its services available for multiple cloud platforms, the prominence and feature set of AWS makes it an extremely attractive target platform. Interestingly, Eucalyptus from Eucalyptus Systems Inc. aims to provide an OSS infrastructure for building AWS-compatible private PaaS environments. It's even possible to move cloud instances between Eucalyptus and AWS. (Elsewhere, Apache CloudStack also supports AWS APIs; OpenShift and CloudFoundry both run in AWS.) There's One Big Catch, however. Just as BI and DW workloads don't easily lend themselves to processing via NoSQL, they likewise can't easily be shifted or transplanted into the massively distributed context of the cloud. More precisely, BI and DW workloads as presently constituted can't easily be shifted or transplanted into the cloud. However, DataStax and other NoSQL players -- including Cloudera, Hortonworks, and MapR, along with untraditional players such as Cloudant (which markets a distributed database based on CouchDB), Datameer Inc., Platfora Inc., and others -- are working to address this.

There's a flip side to this coin, too. If BI and DW workloads (as presently constituted) can't easily be shifted into the cloud, NoSQL data processing platforms -- as presently constituted -- aren't an ideal fit for traditional BI or DW workloads. They're better suited for specific use cases (such as data staging and data preparation), as well as for certain kinds of analytic workloads (especially those involving "multi-structured" -- viz., text, voice, video, and other kinds of not-so-structured -- data). It's possible to view these use cases as the thin end of an inevitable wedge, however: workloads will change, expectations will change; delivery and consumption models will change – and so, too, will the capabilities of NoSQL and other emerging technologies.

Today, for example, most of the NoSQL platforms, along with offerings from vendors such as Actuate Corp. (which markets the BI Reporting Tool, or BIRT), JasperSoft Inc. and Talend, are available via PaaS offerings from Amazon, Red Hat, VMWare, ActiveState, and others. In many cases, they can be installed (if not configured) from an App Store-like Web storefront. From the perspective of traditional BI and DW, these implementations are far from ideal. As customers see it, however, they're good enough -- whether it's for greenfield BI deployments; for seasonal, one-off, or unexpected business or project requirements; for localized -- i.e., workgroup- or business unit-specific -- needs; for developing, testing, or prototyping BI applications and services; and so on.

According to Shackelford, they're only going to get better. "All of this [innovation] has happened so quickly. If you look at Cassandra or even Hadoop today, there's so much more to them than there was even two years ago. We're continuously innovating, the open source community is continuously innovating. Everything has come so far so fast."

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

Find the right level of Membership for you.

Learn More

↑

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

RESEARCH & RESOURCES

DataStax: Anything Hadoop Can Do Cassandra Can Do Better

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

TDWI

Engage

Research