What's Essential -- And What's Not -- In Big Data Analytics
- By Stephen Swoyer
- August 18, 2010
Big Data is hot. Advanced Analytics is hot. The combination (which we’ll call Big Analytics) is blazing. For that very reason, no one seems to agree on the right way to do Big Analytics.
Some vendors -- whom we’ll call the Columnar-Haves, including vendors marketing columnar analytic databases -- claim that columnar is the structure for Big Analytics. Others, whom we’ll call the Columnar-Have-Nots, claim that traditional row-based data stores (coupled with massively parallel processing, or MPP capability) give shops considerably more flexibility than their column-based counterparts.
What everyone does agree on, however, is that the traditional data warehouse -- powered, in most cases, by a commercial, off-the-shelf (COTS) database package -- is no longer relevant. "The analytical techniques and data management structures of the past no longer work in this new era of big data," writes Wayne Eckerson, director of education and research with The Data Warehousing Institute (TDWI), in a new Big Data Analytics Checklist Report.
The report provides an essential primer for doing Big Analytics. Although Eckerson and TDWI stop short of a technology prescription, they do explore the implications of the columnar- versus row-based debate -- as well as the essential concerns that drive any Big Analytics effort.
"[M]ost data warehouses … have reached maximum storage capacity without an expensive upgrade and can't support complex ad hoc queries without wreaking havoc on performance," Eckerson writes. "In addition, the underlying data warehousing platform … isn't scalable enough to support new sources of data [e.g., either internal or external] and maintain adequate query performance."
"To avoid these limitations, companies need to create a scalable architecture that supports big data analytics from the outset and utilizes existing skills and infrastructure where possible," he continues. "To do this, many companies are implementing new, specialized analytical platforms designed to accelerate query performance when running complex functions against large volumes of data. Compared to traditional query processing systems, they are easier to install and manage, offering a better total cost of ownership."
Columnar-haves like to point to the high number of columnar entrants (vendors such as Aster Data Systems Inc., ParAccel Inc., and Vertica Corp., among others) and the columnar-come-lately strategies of vendors such as Netezza Inc. and Oracle Corp., as well as research from prominent market watchers such as International Data Corp., which earlier this year wrote enthusiastically about the benefits and future popularity of column-based data stores.
Not surprisingly, Columnar-Have-Nots tend to take issue with this vision.
They tout their massively parallel processing (MPP) underpinnings -- a topology which they share with most columnar players -- and say that a conventional row-based architecture, coupled with MPP brawn, is more flexible than a columnar MPP topology.
Take John Thompson, U.S. CEO with Kognitio, one of the longest-lived of analytic database players, who claimed that the Neo-Columnar Wave appeared specifically in response to analytic workloads that were overwhelming COTS DBMSs. "My view is that columnar is a really interesting and good technology for certain applications, but … I believe that those applications are receding and becoming more and more of a minority in the trend that we see going toward Big Data [and] Always-On Data," Thompson explained in an April interview.
Columnar skeptics such as Thompson raise questions about the flexibility of a column-oriented design, especially from a data management perspective. It's a familiar tactic: Columnar-Have-Nots (such as Kognitio and Dataupia Inc.) tend to concede advantages in some (very specific) cases, but inevitably raise questions about column-orientation's suitability in "broader" or "general" DW scenarios. Vendors such as Netezza and Greenplum -- which recently introduced
hybrid row/column facilities for their DBMSs -- tend to be more pragmatic on both questions.
Columnar Not (Exactly) the Right Question
Eckerson and TDWI have a more pragmatic take on what might be called the "Columnar Imperative."
Far from arguing over the benefits (or drawbacks) of a column-based architecture, shops would be better advised to focus on other, potentially more important issues. Row- or column-based engines marketed by Aster Data, Dataupia, Greenplum Software Inc. (now an EMC Corp. property), Hewlett-Packard Co. (HP), InfoBright, Kognitio, Netezza, ParAccel, Sybase Inc. (now an SAP AG property), Teradata, Vertica, and other vendors (to say nothing of the specialty warehouse configurations marketed by IBM, Microsoft, and Oracle) are by definition architected for Big Analytics.
Yes, some will scale better than will others -- in specific configurations (or for specific applications) -- but scalability, at least in this context, is a particular and not a universal consideration.
A more important consideration, according to Eckerson, concerns the "available options" that are unique to the different analytic database engines. Analytic database players compete on precisely these options. Just as in the automotive world -- where features such as anti-lock breaks, automatic transmissions, or airbags have morphed from nice-to-have options to need-to-have requirements -- some features come standard with all analytic databases and some are still optional.
Analytic database vendors today compete on the basis of several options -- capabilities such as in-database analytics, support for non-traditional (typically non-SQL) query types, sophisticated workload management, and connectivity flexibility.
Every vendor has an option-laden sales pitch, of course -- but few (if any) stories are exactly the same. In-database analytics is particularly hot, according to Eckerson. All analytic database vendors say they support it (to a degree), but some -- such as Aster Data, Greenplum, and (more recently) Netezza, Teradata, and Vertica -- seem to support it "more" flexibly than others.
"[S]o-called 'in-database analytics' minimizes or eliminates data movement, improves query performance, and optimizes model accuracy by enabling analytics to run against all data at a detailed level instead of against samples or summaries," writes Eckerson, who notes that the in-database approach "is particularly useful in the 'explore' phase, when business analysts investigate data sets and prepare them for analytical processing, because now they can address all the data instead of a subset and leverage the processing power of a data center database to execute the transformations." (In-database analytics is likewise important in what Eckerson calls the "scoring" stage -- i.e., when an analyst applies a model or function to incoming records.)
"With in-database analytics, scoring can execute automatically as new records enter the database rather than in a clumsy two-step process that involves exporting new records to another server and importing and inserting the scores into the appropriate records," he explains.
The twist comes by virtue of (growing) support for non-SQL analytic queries, chiefly in the form of the (increasingly ubiquitous) MapReduce algorithm. Aster Data and Greenplum have supported in-database MapReduce for two years; more recently, both Netezza and Teradata, along with IBM, have announced MapReduce moves. Last month, open source software (OSS) data integration (DI) player Talend announced support for Hadoop (an OSS implementation of MapReduce) in its enterprise DI product. Talend's MapReduce implementation can theoretically support in-database crunching in conjunction with Hadoop-compliant databases.
Although support for non-SQL analytics is today a nice-to-have option, it could soon become a need-to-have standard feature, according to Eckerson.
"Many analytic computations are recursive in nature, which requires multiple passes through the database. Such computations are difficult to write in SQL and expensive to run in a database management system," he points out. "[T]oday most analysts first run SQL queries to create a data set, which they download to another platform, and then run a procedural program written in Java, C, or some other language against the data set. Next, they often load the results of
their analysis back into the original database."
This approach makes about as much sense in the case of non-SQL query as it does (to recap) in the case of SQL-based stuff. The solution, once again, is optional in-database support for non-SQL analytics, Eckerson explains.
"[T]echniques like MapReduce make it possible for business analysts, rather than IT professionals, to custom-code database functions that run in a parallel environment," he writes. As implemented by Aster Data and Greenplum, for example, in-database MapReduce permits analysts or developers to write reusable functions in many languages (including the Big Five of Python, Java, C, C++, and Perl) and invoke them by means of SQL calls.
Such flexibility is a harbinger of things to come, according to Eckerson. "[A]s analytical tasks increase in complexity, developers will need to apply the appropriate tool for each task," he notes. "No longer will SQL be the only hammer in a developer's arsenal. With embedded functions, new analytical databases will accelerate the development and deployment of complex analytics against big data."
Integration and Interoperability Still Matter
Lastly, Eckerson urges, don't forget mission-critical amenities -- issues such as integration, interoperability, reliability, availability, and security. Although analytic databases were first pitched as "appliances" -- as plug-in or turnkey complements (or, in some contexts, as rip-and-replace alternatives) to an existing DM infrastructure -- in practice, such offerings are rarely, if ever, non-disruptive. This is one reason some vendors (Teradata and, more recently, HP) emphasize what they claim are best-in-class workload management features.
Their analytic databases better integrate with a shop's existing DM infrastructure, both vendors like to claim, and they boast both the scalability and the flexibility to support a wide range of users, applications, and queries. More recently, other players -- such as Aster Data, Kognitio, Netezza, and Vertica -- have hyped their own workload management efforts.
Moreover, most players like to tout the resiliency and built-in fault tolerance of the MPP architecture -- although (in a familiar move) some claim to be more fault tolerant (or more resilient) than others.
These and other issues are assuming greater salience, according to Eckerson.
"[I]nvestigate whether the analytic database integrates with existing tools in your environment, such as ETL, scheduling, and BI tools. If you plan to use it as an enterprise data warehouse replacement, find out how well it supports mixed workloads, including tactical queries, strategic queries, and inserts, updates, and deletes," Eckerson concludes. "Also, find out whether the system meets your data center standards for encryption, security, monitoring, backup/restore, and disaster recovery. Most important, you want to know whether or to what degree you will need to rewrite any existing applications to run on the new system."