Which Big Data Platform Is Right For You?
Big data platforms represent a tidal shift in the way organizations capture, store, and process data. They also depend on the number and variety of applications supported.
- By Wayne Eckerson
- February 5, 2016
What is a Big Data Platform?
Few changes in the history of data management have been as sweeping or as disruptive as the phenomenon known as big data. The term generally refers to the avalanche of data now available to help organizations shape strategy, optimize operations, and identify customer tendencies and proclivities. Big data is the fuel that powers decisions and optimizes outcomes in new digital organizations.
Unfortunately, prior generations of data management technology aren’t scalable or flexible enough to handle big data. In response, data management vendors have introduced a new class of products that meet the scalability, performance, availability, and security requirements of big-data-driven organizations.
Called big data platforms, these products represent a tidal shift in the way organizations capture, store, and process data. Designed to process high volumes of multi-structured data in batch or real time, they consist of an ecosystem of components ranging from databases and file systems to various data processing and analytic engines to management, governance, and administrative tools.
Many big data platforms combine both relational and non-relational technology, and some rely primarily on Hadoop and NoSQL. Some run exclusively within corporate data centers and others, only in the cloud. Most require in-house technical experts to install, configure, tune, and manage the big data platforms, and an emerging class of cloud services providers install and operate big data platforms on behalf of subscribing customers.
Platform for Data-Driven Applications
The success of a big data platform depends on the number and variety of applications it supports.
For example, data engineers use a big data platform to parse, clean, transform, aggregate, and prepare data for analysis. Business users use it to run SQL and NoSQL queries against the platform. Data scientists use it to discover patterns and relationships in large data sets using machine-learning algorithms. Organizations build custom applications on big data platforms to calculate customer loyalty, identify next-best offers, spot process bottlenecks, predict machine failures, monitor the health of core infrastructure, and so on.
Classifying Big Data Platforms
There are about a dozen big data platform products on the market today. They can be divided into three categories based on their heritage technology: relational databases, Hadoop distributions, and cloud managed services (see Figure 1). These categories provide a shorthand for understanding each vendor’s big data platform and strategy for achieving critical mass in the big data marketplace. Each category also appeals to different types of customers, sometimes within the same organization.
Figure 1. Classes of Big Data Platforms
Source: Selecting Big Data Platforms: Building a Foundation for the Future, Eckerson Group, 30 pages, 2015.
Relational databases: Big data platforms include relational database vendors (such as Actian, SAP, Teradata, Oracle, Microsoft, and HP) as well as upstarts (such as Pivotal). Their relational databases have powered many generations of data warehouses, data marts, operational data stores, and analytic applications. These systems use structured query language (SQL) to process and manipulate structured data stored in relational tables linked together using primary and secondary keys.
Hadoop distributions: Big data platforms based on Hadoop are market newcomers that have appeared within the past several years. The primary vendors in this space (MapR, Hortonworks, and Cloudera) run Hadoop as their core data processing platforms, which they supplement with a blizzard of open source software and, in some cases, proprietary software.
Cloud managed services: This category includes pure-play cloud service providers that manage and operate big data platforms on behalf of subscribers in the cloud. More than a platform-as-a-service, a cloud managed service lets customers focus solely on analyzing data and building data-driven applications rather than data infrastructure. In addition, cloud managed services provide a quick and easy way for customers without information technology experts or available servers to try out or deploy a big data platform. Leading cloud managed service providers include Altiscale, Qubole, Treasure Data, Cazena, and Amazon Web Services (AWS).
Clarifying Market Strategies
The vendors in each of these three product categories have adopted different strategies to gain traction in a fast-moving market.
Connect, integrate, and imitate: For example, relational database vendors see big data as both a competitive opportunity and a threat. On the one hand, big data opens up new data sources and applications for its products. On the other hand, big data brings a host of new competitors offering innovative new technologies and approaches at rock-bottom (i.e., open source) prices.
To survive and thrive, relational vendors have adopted a host of strategies, ranging from partnering with Hadoop distribution vendors and re-selling or white-labeling their technology to developing technology that connects to, virtualizes, or ports relational technology to Hadoop. Some have gone further, creating their own Hadoop distributions (e.g., IBM) or open sourcing their relational database (e.g., Pivotal).
Open source and then some: For their part, Hadoop distribution vendors use some variation of the open source model to compete against relational vendors and each other. For instance, all use Apache Hadoop, but MapR and Cloudera have added proprietary extensions: MapR underneath the Hadoop and HBase APIs and Cloudera with proprietary add-on products. Hortonworks, the only major Hadoop distributor committed to a fully open-source strategy based on Apache Hadoop, has formed a consortium of other like-minded vendors (the Open Data Platform Initiative) to isolate MapR and Cloudera.
The outsourced cloud: Finally, although most relational databases and Hadoop distributions now run in the cloud to some degree and some vendors offer their own cloud public clouds, Altiscale, Treasure Data, Qubole, Cazena, and others can be considered true cloud managed services. Altiscale offers an Apache Hadoop-based platform and runs its own public. Treasure Data uses components of Hadoop but not the Hadoop Distributed File System. Qubole persists data in both Hadoop and non-Hadoop systems. Cazena provides only single-tenant processing to boost security.
For more information about big data platforms, you can download the 30-page report, Selecting Big Data Platforms: Building a Foundation for the Future, from the Eckerson Group website. The report describes the history, benefits, and challenges of big data platforms; identifies three classes of products, mapping them to user requirements; and provides a set of criteria to help readers conduct a detailed evaluation analysis of products.
Wayne Eckerson has been a thought leader in the business intelligence and analytics field since the early 1990s. He is a sought-after consultant and noted speaker who thinks critically, writes clearly, and presents persuasively about complex topics.
Eckerson has conducted many groundbreaking research studies, chaired numerous conferences, and written two widely read books: The Secrets of Analytical Leaders: Insights from Information Insiders (2012) and Performance Dashboards: Measuring, Monitoring, and Managing Your Business (2005/2010). He is currently working on a book about data governance.
Eckerson is principal consultant of Eckerson Group, LLC, a business-technology consulting firm that helps business leaders use data and technology to drive better insights and actions. His team of researchers and experienced consultants provide cutting-edge information and advice on business intelligence, analytics, performance management, data governance, data warehousing, and big data. They work closely with organizations that want to assess their current capabilities and develop a strategy that optimizes their investments in business intelligence and analytics.
For many years, Wayne served as director of education and research at TDWI, where he oversaw the company’s research and training programs and chaired its BI Executive Summit. He has also served as an industry analyst at the Patricia Seybold Group and TechTarget. He has a bachelor’s degree from Williams College and a master’s degree from Wesleyan University.