RESEARCH & RESOURCES

Q&A: Deflating Big Data Myths

Does the term big data still have meaning? Analyst and author Barry Devlin tackles that question and others about the popular technology.

Does the term big data still have meaning? Well-known analyst and author Barry Devlin tackles that question and others, including the often-overlooked privacy concerns big data raises. "The combined total of all the data available on the Web and in private databases is so large and so interrelated," Devlin says, "that privacy becomes well-nigh impossible."

A founder of the data warehousing industry, Devlin speaks and writes widely on the topic, including as an associate editor of TDWI's Business Intelligence Journal. He has 30 years of experience in IT as an architect, consultant, manager, and software evangelist. His company, 9sight Consulting, provides consulting services to clients worldwide. His newest book is due out later this year.

BI This Week: Let's begin with a point you made in a recent article for BeyeNetwork: Is there a difference between big data and just data, period?

Barry Devlin: I've struggled with the term "big data" since I first came across it over three years ago. The problem is that "big" is a completely subjective term. Similarly, all the "v" words -- and there are now so many of them, including velocity, variety, volume, veracity, and so forth -- are also entirely relative. How does a business know if they have a big data requirement, problem, or opportunity? The short answer is: they don't.

In this confusion, the market has been deluged with big hype from vendors (there's another one of those "v" words!) who are labeling existing products as big data or NoSQL, and bolting Hadoop onto just about every conceivable platform. I've recently seen IDMS described as one of the earliest NoSQL databases! What about IMS?

Then there's the misconception that big data equals Hadoop. Of course, this isn't true, irrespective of how you try to define "big data." Hadoop is just an open source, parallel programming environment for commodity hardware platforms.

So what are customers doing? We found in a joint 9sight /EMA survey last year that they seem to be declaring many projects as big data that they would have undertaken anyway under some other label. My guess is that this is often for internal justification needs -- big data is a topic the executives have seen in the business press as a "must have" for success, so it's easier to get budget approval for a project with "big data" in the title. (This might sound cynical, but many IT shops work this way.)

My overall take on all this is to step back and try to rationalize the situation. The term "big data" cannot easily be defined. Some vendors are attaching it to a wide range of products, while others equate it with Hadoop. Clearly, customers have already adopted and adapted the term for their own purposes.

If I could stop people from using the phrase, I would. Since I cannot, I have taken the position that "big data is all data," and I try to move the discussion to three data types that have better-defined characteristics: human-sourced information, machine-generated data, and process-mediated data. (See my white paper "Big Data Zoo" for more details on these three information domains).

That helps to explain the following point, also from one of your recent blog postings: "Many so-called big data projects have more to do with more traditional data types, i.e., [are] relationally structured, but are bigger or require faster access."

When WalMart was building the world's largest data warehouse back in the 1990s, they failed to notice they were undertaking a big data project. Why was that? Because the term had not yet jumped the "species boundary" from the world of physical science, where it was first used. Today, many similarly scoped projects using large volumes of traditional, relational, process-mediated data are labeled big data.

If we have big data, do we also have big analytics?

What the hype around big data has done is to create focus around statistical analysis of large volumes of information. Those of us who have been around a while (since back in the mid-1990s!) remember when this was called data mining. What has happened, in my view, is that advances in technology (processing, memory, and storage) have enabled at least two things: (1) broader use of statistical analysis techniques because of a much lower barrier to entry and (2) a move from sample set to full-set analysis.

Combined with the mushrooming of social media data (which is a component of human-sourced information), the opportunities for analytics have grown enormously. Certainly, there is value to be found there.

My concern is that statistical analysis is somewhat of a "black art" in which it is extremely easy to draw invalid conclusions from the process if one lacks basic statistical training. The suggestion that business users can self-serve if the tools are simple enough is worrying. Self-service BI is far less demanding of user skills, and even there, we have seen cases where data is misused, or used to draw incorrect conclusions. Businesses must move forward carefully in this area. Big data / big analytics is, I often joke, spreadsheets on steroids.

On the positive side, the elevation of data scientists to guru status has driven universities to take notice and begin to train students in some of the basics of playing with data -- from how to gather and integrate it, through analysis techniques and limitations, to the ethical concerns around privacy and accountability, particularly around personal information.

Here's a basic question from your blog (in fact, I'm quoting you): "Are our prior architectural and design decisions still relevant in the light of today's business needs and technological advances?"

We do need to examine all these prior architectural and design decisions, which were made when technology was much less powerful, data sources were entirely internal, and business demands were simpler. The emerging "biz-tech ecosystem," as I call it, is a much more complex and deeply interconnected environment than existed in the 1980s, when data warehousing was invented. Some of the design decisions will still be valid, while others have outlived their usefulness. This a topic I've been developing for a number of years now; it's a foundation for my new book "Business Unintelligence -- Via Analytics, Big Data, and Collaboration to Innovative Business Insight," which will be released later this year.

With all the talk about big data, there doesn't seem to be much attention paid to its downside and possible misuse. Can you discuss that?

Yes, this is a vital topic to which far too little attention is paid in the mainstream. Here are two quotes I use in my book:

Media that spies on and data-mines the public is destroying freedom of thought and only this generation, the last to grow up remembering the "old way," is positioned to save this, humanity's most precious freedom. -- Eben Moglen, professor of law and legal history at Columbia University and chairman of the Software Freedom Law Center at re:publica Berlin, in May, 2012

The Senate [passed] legislation ... granting the public the right to automatically display on their Facebook feeds what they're watching on Netflix. ... However, they [lawmakers] cut from the legislative package language requiring the authorities to get a warrant to read your e-mail or other data stored in the cloud. -- David Kravets, in Wired magazine, December, 2012

The point here is that big data in the sense of the combined total of all the data that is available on the Web and in private databases is so large, and so interrelated, that privacy becomes well-nigh impossible. While researching my book, I came across two articles published in the New York Times in the same week during January 2013 that illustrate the point. The first describes how patient records -- transcribed and digitized from doctors' notes, made anonymous, and stored on the Web -- can be statistically mined to discover previously unknown side-effects of, and interactions between, prescribed drugs. That's clearly useful and valuable work. The second article, three days later, revealed how easily a genetics researcher was able to identify five individuals and their extended families by combining publicly available information from the supposed anonymous 1000 Genome Project database, a commercial genealogy Web site, and Google.

The underlying genetic data is used in medical research to good effect, of course, but what are the possible consequences for those individuals thus identified. Certainly, we can imagine that insurance companies, governments, or other interested parties might well make potentially negative assessments based on their once-private genomes.

Let me end by saying that the old phrase caveat emptor -- buyer beware -- is in need of updating. You no longer have to buy anything before taking care of what you are exposing of yourself. Your Google search history may be enough.

TDWI Membership

Get immediate access to training discounts, video library, BI Teams, Skills, Budget Report, and more

Individual, Student, & Team memberships available.