Big Data -- Why the 3Vs Just Don't Make Sense

The "big" in "big data" is a function of the volume, variety, and velocity of the information that constitutes it. If you read a dozen articles on big data, there's a good chance the 3 Vs -- volume, variety, and velocity -- will be cited in at least half of them.

This strikes many industry veterans as wrong-headed because if big data is understood solely on the basis of these trends, it isn't clear that it's at all hype-worthy. It isn't clear, in other words, that what we mean by "big data" comprises a distinct departure from the data management (DM) status quo.

The confluence or collision of the 3V trend lines certainly isn't a new idea: Gartner Inc. analyst Doug Laney first codified this concept -- using the same 3 Vs -- over a decade ago while an analyst with (the now-defunct) MetaGroup. According to a recent report authored by Philip Russom, director of research with The Data Warehousing Institute (TDWI), the volume characteristics that we today associate with big data arguably date back to "the early 2000s, [when] storage and CPU technologies were overwhelmed by the numerous terabytes of big data." Enterprises have been dealing with big data and its 3 Vs for a decade now, and the practices they've used to do so aren't necessarily "broken."

They might be sub-optimal or inelegant; they might even be outmoded, but if "big data" simply describes the volume, variety, and velocity of the information that constitutes it, our existing data management practices are still arguably up to the task.

There's a disconnect here, however. In practice, big data is hyped on the basis of its real or imagined outputs -- e.g., for the breathtaking possibilities of big data analytics; the gleaning or unearthing of dramatic new insights: the intelligibility, coherence, and sensibility that we believe to be embedded in the dizzying volumes, varieties, and velocities of our data.

A focus on volumes, varieties, and velocities, some say, misses the point.

"Big data is a lot more interesting when you bring in 'V' for value," grouses Michael Whitehead, CEO of data integration (DI) specialist WhereScape Inc. "Does new data enable an organization to get more value, and are we doing enough to get to that value quickly?"

In fact, says Whitehead, the big data 3 Vs help to perpetuate the stereotype of the navel-gazing IT-type who Just Doesn't Get It.

"No wonder IT people are seen as being disconnected from our user community, when we describe the problem based on its inputs not [its] outputs," he concludes.

Robert Eve, executive vice-president of marketing for data virtualization (DV) specialist Composite Software Inc., echoes Whitehead's assessment. "Let's face it, our industry simply loves acronyms … and numbers," he comments, citing ERP, EPM, CRM, and Web 2.0 as examples in kind. "Unfortunately, this time we used the right acronym, but the wrong Vs."

Eve, like Whitehead, emphasizes the importance of value in any big data calculus.

"Shouldn't the first 'V' be the business value derived from analyzing big data? Might we be better served if the second 'V' [were] the vision required to successfully synthesize business needs, analytic capabilities and source data?" he asks, concluding with a plug for Composite's own take on big data. "[G]iven all the complexity involved at every large enterprise today, perhaps the third 'V' should be the virtualization required to simplify and accelerate these efforts."

Value-Driven

Eve's point about the inevitability of the acronym is a good one. It's what most vexes industry veteran Marc Demarest, a principal with Noumenal Inc., a management consultancy that specializes in information technology, biotechnology, nanotechnology, and related disciplines. "The DW/BI industry has always suffered from terminological overload and terminological vacuity," says Demarest, who concedes that, absent context, the 3 Vs of big data partakes of both overload and vacuity. As acronyms go, 3 Vs isn't as bad as it could be, Demarest points out.

"The 3V model is better than what we've had before, as it gets people focused on (some of) what's changing," he argues, noting that data volumes are increasing, while data volatility -- understood as a function of the increasing diversification and acceleration of data -- poses a growing challenge. To the extent that a handy acronym such as "3V" helps make what's happening more intelligible to non-technical folks, that's good, he avers.

It likewise opens itself up to other missing (but no less alliterative) Vs, says Demarest, who cites value, verifiability, variability -- which he distinguishes from variety -- and a host of other potential "V" attributes.

"Is this V model really a good way for any company to think about its big data challenge?" he asks. "I think the answer to that is really 'no.' The V model is a way to normalize, within bounds, an industrywide discourse, but it doesn't add much value -- V! -- if you are trying to figure out how your organization should approach big data."

Revisiting -- and Recasting -- the Problem

Demarest and Mark Madsen, a principal with information management consultancy Third Nature Inc., periodically co-present a TDWI seminar in which each takes a pro/con position with respect to big data. The idea is that each starts with an extremist or over-determined position; by seminar's end, both arrive at a kind of pragmatic common ground.

One aspect of this common ground is the idea that an organization either survives or thrives on the basis of what it does --or it doesn't do -- with its data.

That's the test that Demarest proposes for big-data-as-a-problem. Not volume, variety, or velocity. If an organization treats its data as decisive in this context, for example, then it has a big data "problem." If it doesn't, then big data isn't a "problem" for it.

That's the test. The 3 Vs don't factor into it. Neither do any number of Vs, for that matter.

Demarest suggests a pair of contrasting cases: if a company treats its data as an asset which might confer competitive advantage, but which isn't integral or necessary to compete, it doesn't have a big data problem. The tools, practices, policies, and methods it uses to manage its data assets will likely continue to serve it in good stead, he says.

This case also helps shed some light on the inadequacy of the 3 Vs: an organization that falls into this category could be -- and probably is -- dealing with growing volumes, varieties, and velocities; because of the way it uses its data, however, it doesn't have a big data problem.

If, on the other hand, a company treats its data like an essential competitive asset -- as something without which it either cannot compete or cannot exist -- it has a big data problem, Demarest argues. Its big data problem has little to do with increasing volumes, varieties, and velocities: it's (rather) a function of what this organization wants to do with its data.

Volume, variety, and velocity might make problematic what an enterprise wants to do, but these three are only a few of the problem variables. Here, again, the 3 Vs are insufficient.

Third Nature's Madsen, for his part, says he mostly agrees with this formulation, at least as a litmus test for big data problems. He's less sanguine than is Demarest about the 3 Vs, however.

"The 3 Vs are attached to big data as if they contain explanatory power and [an] ability to tie a product to one or more 'values' of the esteemed product in question. I'd say this is partially true at best," he observes. Madsen's biggest beef with the 3 Vs is that they're usually adduced as Something New.

"I see the 3 Vs being used to explain change over time or a break from the past: there's more, it's faster, and it's different," he points out. "As [Gartner's] Laney [has] pointed out, he wrote that stuff a decade or so ago, and was simply writing about the burgeoning data in the data warehousing and business intelligence world."

Madsen uses the example of "expanding" or of "exploding" data volumes, which are usually advanced as Exhibit A in any discussion of big data and its 3 Vs. Trouble is, they've been consistently and predictably "expanding" (or "exploding," "skyrocketing," and so on) for years.

In the present case, he argues, data volumes are (if anything) expanding in a different context: that of the Web 2.0 or the collaborative/social world. To some extent, the response to this involves reprising or retrofitting practices -- i.e., technologies, concepts, and methods -- that were developed to manage traditional (relational, hierarchical, semi-structured) data sources.

That's why we're seeing the DBMS-ification of Hadoop, Madsen concludes, somewhat tongue in cheek. "[Hadoop is] re-evolving exactly like the DB industry: Schema? Check. Indexes? Check. Catalog? Check. Random lookup? Check. SQL-like interface? Check. And so it goes."