Is Big Data Tipping toward Pragmatism?
The theme of this month's TDWI World Conference in Chicago was "The Big Data Tipping Point." That tipping point may be occurring in some not-so-obvious ways.
- By Stephen Swoyer
- May 21, 2013
This month's TDWI World Conference in Chicago, billed as "The Big Data Tipping Point," opened with a keynote address by Ken Rudin, head of analytics with Facebook, an archetypal big data company. It developed Hive, the SQL-like semantic layer for Hadoop, which it used to power its Hadoop-based data warehouse (DW) environment.
In his presentation, Rudin staked out a pragmatic "both/and" position -- championing both Hadoop and RDBMS technologies -- in place of "either/or."
"[Facebook] started in the Hadoop world. We are now bringing in relational to enhance that. We're kind of going [in] the other direction," Rudin told attendees. "We've been there, and [we] realized that using the wrong technology for certain kinds of problems can be difficult. We started at the end and we're working our way backwards, bringing in both."
Rudin explained that "traditional systems like relational are really good at ... [answering] the traditional business questions that we all still ask and will ask, and that's not going away just because the new technologies are there."
Thus one "tipping point" -- perhaps you could even call it a correction to a kind of misguided big data exuberance. "Doing" big data doesn't mean wiping the slate clean; it doesn't mean throwing out the rulebook, abandoning best practices, or ripping-and-replacing existing technologies.
In fact, Philip Russom, director of data management (DM) with TDWI Research, has argued that "retooling" for big data can be as simple as making "adjustments" to how a DW is structured and managed.
In April, Russom told BI This Week that "enterprise data warehouses are not going away. There is this tendency to say 'Oh, big data is this new thing; it has all of these new requirements; therefore, I must throw away my old things.' Enterprise data warehouses [are] not going away, [because] they're still killer platforms for the things they're designed for, [such as] standard reports."
Return to Normalcy?
It may be that the "Big Data Tipping Point" involved a move toward pragmatism.
As if in response to Facebook's come-to-relational revelation, several of the exhibitors at the TDWI World Conference in Chicago all but said: We told you so.
"The home base for executive decision-making, for decision-making at every level [in the enterprise], is the traditional data warehouse. Big data doesn't change that," said Mark Budzinski, vice president and general manager with agile DW specialist WhereScape Inc.
Budzinski and WhereScape have been consistent critics of big data clean-slate-ism.
CEO Michael Whitehead, for example, consistently sticks to his data warehousing guns -- even when challenged on the putative "necessity" of architectural change.
"When you look at some of the problems that the data warehouse is designed to solve, it's just a natural way to answer a set of questions," he told BI This Week in a March interview. "You just always end up at the same point: [i.e. that] it's natural to have a repository of data that's materialized and to keep it there for a certain set of problems."
In interviews at TDWI Chicago, representatives from IBM Corp. emphasized a selection of "key" issues -- including governance, fault-tolerance, and security -- to which, they argued, big data technologies (in the form of open source software projects or big data start-ups) typically give short shrift. "Ours [i.e., vision] is what we call the 'Big Data Platform,' [which is] a combination of technologies working together with integration and horizontal capabilities across those technologies. We emphasize information governance and security as [a] key [part of this]. We feel governance and security are lacking in most [offerings] on the market," said Nancy Kopp-Hensley, program director for Netezza product marketing.
IBM didn't announce any new product releases at TDWI Chicago; in early April, however, Big Blue announced an upcoming PureData appliance for Hadoop. That product, slated to ship in August, will support several Hadoop use cases, including online, interactive SQL querying. (This casts Hadoop as a queryable archive.) In this respect, Kopp-Hensley explains, the PureData appliance for Hadoop could be compared with existing technologies, such as Hive or Teradata Inc.'s SQL-H (which permits SQL-like querying of Hadoop) -- with the exception that it supports ANSI-compliant SQL queries.
The caveat is that IBM maintains its own proprietary implementation of Hadoop, dubbed BigInsights. There's a reason for this, according to Kopp-Hensley. "It's interesting to watch this [Hadoop] market sort of evolve. You start to see products show up before people understand what to do with them, or [before] they understand what's involved in using them."
Now that enterprises are actually grappling with Hadoop and big data, they're hitting a wall, she argued: IT policies require that data center solutions address issues such as security, fault tolerance, and governance. Big data upstarts are just beginning to tackle these problems, however.
"What we're saying is that they can bring this [Hadoop, via BigInsights] capability in to enhance their existing ecosystem, [and that] we'll deliver value by focusing on some of the things [open source Hadoop] lacks. In our Enterprise Edition of BigInsights, for example, we have high-availability capabilities. Most of the distributions now are starting to think about the value-add that needs to go back into these things, like workload management, high availability, security, and governance. We came out of the gates with BigInsights and we said, 'These things are important.' It was sort of quiet for awhile, then people built [Hadoop] clusters, and they said, 'What the heck do I do about [these requirements]?'"
Bringing It All Back Home
Elsewhere at TDWI Chicago, several market trailblazers followed Facebook's lead, making peace with -- or, at least, nominally embracing -- existing technologies.
Teradata used the event to announce its "Teradata Intelligent Memory" option, which will be available in June for its Teradata Database platform.
Is Intelligent Memory a come-to-in-memory-computing moment for Teradata? Not really.
By the textbook definition (in which both the indexes and contents of a database are loaded into and run entirely from memory) Intelligent Memory isn't an in-memory computing technology. By the prevailing (looser) standards of other so-called "in-memory" technologies, however, Intelligent Memory fits the bill.
Teradata officials, for the record, emphatically reject the term "in-memory;" instead, they position Intelligent Memory as a pragmatic alternative to a textbook in-memory implementation, which they say is unworkable at Teradata-scale.
"'In-memory' means you put everything in memory. That really should be the appropriate industry definition -- [i.e.] that your entire database fits in memory. That's where we draw the line," said Sam Tawfik, product marketing manager with Teradata, in a briefing.
"We don't think you're going to be able to fit all of your data in memory because your data will grow faster than the size of your memory, especially in the Teradata environment, [where] we already have customers who have terabyte tables and multi-petabyte databases."
ParAccel, which was acquired in late April by Actian Inc., touted its own in-memory option; it claimed that Yahoo Inc. (a prominent customer reference) runs on a combined 40-TB of in-memory capacity. There's disagreement as to just what kind of "in-memory" technology ParAccel is, however; officials claim the ParAccel engine can pre-cache the contents of a database into memory; at least one attendee questioned whether ParAccel effectively optimizes how it uses memory, ala Kognitio or SAP AG with HANA.
In-memory aside, representatives were promoting version 5.0 of the ParAccel database engine, which is slated to ship in June.
There's a sense in which ParAccel 5.0 is something of a come-to-existing-database-technology release, too. Rival Teradata has long competed on the basis of its workload management chops, promoting its "Active Workload Management" capabilities as a competitive differentiator, particularly with respect to upstart analytic database platforms -- such as ParAccel. At the TDWI conference, ParAccel officials promoted an AWM acronym of their own -- namely, "Adaptive Workload Management," a key component of ParAccel 5.0.
"The way that we've implemented Adaptive Workload Management is that we [permit an administrator to] define 'Resource groups,' [which are] based on departments, users, applications, or query types. [An administrator] can allocate overall resources to those different groups [by assigning] a relative weighting, and this makes it easier to add [additional workloads]," explained Hannah Smalltree, director of product marketing with ParAccel. "When [new users come online], our Adaptive Workload Management automatically normalizes the [new] workloads to come back to this [weighted] state."
ParAccel 5.0 will ship with a significantly revamped query optimizer. "Our query optimizer is really one of the things that makes ParAccel such a high performance platform. With our Omne Query Optimizer [new to ParAccel 5.0], we have gone from having one optimizer to having a framework of multiple optimizers," she said, explaining that ParAccel 5.0 implements five different optimizers. Because its Omne Query Optimizer leverages a "modular" framework, Smalltree claimed, ParAccel plans to extend it (namely, by adding modules) to support new or different types of queries.
Big Data at the Bleeding Edge
The events of TDWI Chicago also highlighted emerging -- and potentially disruptive -- trends.
Take vendors Solace Systems Inc. and VelociData Inc., which shared an exhibit booth.
Both companies focus on what might be called emerging emerging big data use cases. For example, both market black-box hardware appliances. In the DM space, the term "appliance" is typically used to signify some combination of industry-standard hardware and software "special sauce," although some vendors -- e.g., the former Netezza (with its PowerPC-based Snippet Processing Units) and the former KickFire (with its FPGA accelerator technology) -- have marketed appliances outfitted with specialty or proprietary hardware.
That being said, "appliances" as marketed by Cisco Systems Inc., Hewlett-Packard Co., IBM Corp., Oracle Corp., and Teradata tend to be highly proprietary, inasmuch as they're based on vendor-specific blade or rack technologies and ship with proprietary management tooling, middleware, and other amenities.
The archetypal DW appliances -- e.g., analytic database appliances from Aster Data Systems, Greenplum Software, DATAllegro, Dataupia, Kognitio, ParAccel, and Vertica -- all matched industry-standard hardware with their own software special sauce.
VelociData and Solace Systems, on the other hand, market a type of "appliance" that uses proprietary silicon, based on field-programmable gate array (FPGA) technology.
When VelociData says its appliances "accelerate" data integration (DI), it isn't talking about process or methodological improvements (e.g., agile data warehousing, ELT/ETL bypass, or data virtualization, among others). No, VelociData is talking about hardware acceleration. The focus is on doing DI at big data scale: VelociData says its DI appliances support sustained throughput of up to 10 Gbps. In some cases, VelociData claims, customers can incorporate its DI appliance into existing ETL workflows, where it can be used to perform ETL offload and acceleration. VelociData claims to support a slew of data quality-oriented features, too, including data cleansing and validation; address standardization; and binning and classification. It also markets an appliance for text mining.
Solace Systems markets hardware appliances that accelerate streaming message traffic. This, too, is a just-emerging big data use case: today, its biggest customers tend to cluster in the (bleeding edge) financial services and telecommunications verticals, says Bill Romano, a senior systems engineer with Solace Systems. "We aren't a big data solution, as most people think about big data today. Big data is storage, big data is analytics, big data [in this context] is big data at rest. It's people analyzing events that have already happened. Solace is a messaging appliance; it's really about data in motion," Romano explained.
Solace has several partnerships with prominent DM vendors, including SAS Institute Inc. and StreamBase Systems Inc. "We work with StreamBase, we work with SAS, specifically around their event-stream processing tools. Event-stream processors take data and move [it]; the only thing we do to the data is route it based on the 'topic,'" Romano continued, explaining that the Solace appliance tags messages with certain pre-determined "topics" and routes them accordingly. "We carry everything from stock market updates about trades to frames of video. Our hardware [underpinnings] gives us quite a bit more speed [than conventional message buses]; we do millions of messages per second."