Q&A: Big Data through the Looking Glass -- Two Views on the Big Data Revolution
Is Big Data a revolution in information technology? Does the technology truly offer us the ability to do something new, or do we simply need to renovate the technologies we already have?
- By James E. Powell
- July 10, 2012
Is "big data" a revolution in information technology? Does the technology truly offer us the ability to do something new, or do we simply need to renovate the technologies we already have? How has big data impacted how we conduct analysis, and what does this mean for BI professionals?
To learn more about current big data trends and thinking, we turned to Mark Madsen, president of Third Nature, Inc., a technology consulting and market research firm focused on BI, data integration, and data management. Mark is presenting the keynote address, Big Data through the Looking Glass, with Marc Demarest at the TDWI World Conference in San Diego, July 29 - August 3, 2012.
BI This Week: The title of your session refers to two views on big data. What are the views you're talking about?
Mark Madsen: When you look at the market today, there are a pair of opposing views. Both views take a few assumptions as given, such as fast-growing data volumes, performance problems serving up that data, and the need for more and better use of the data. They view the solutions to these problems as being in opposition.
On the advocate side, you have people claiming that we have new technology that sweeps everything that came before it away, a revolution for information technology. This is what you hear most as the hype in the market grows.
You don't buy into it?
Not to the extent they're talking about. The marketing from some of the vendors in this space is completely over the top, bordering on ludicrous. There are some interesting aspects to the technology change, and I certainly agree with them that we're seeing more sophisticated use of data by people as well as increasing automated use of machine learning and statistics to drive direct online systems.
I think that's part of the reason for the hype. The organizations that received the most value early on were largely in the Web application and social media spaces. They had needs for data and user scalability that dwarf what was required by most conventional IT groups. This meant they had to develop new approaches and technology. They're the poster children for the big data market. I don't think they would have been successful with the old architectures.
The other problem I have is the way the analytics market has collided with the big data market. Much of what we're talking about is actually analytics with lots of data, driven by new business models or the analysis of people's behavior. This created a mythology that the vendors promote, that a person working alone with a mass of data will discover key insights that can change the business -- probably true in a startup fundamentally built around data. The question is how applicable their solutions are to a broader IT market.
Can people gain those insights in other types of organizations?
It's not that so much as that the Web and social companies are dealing with data of a different complexity. These new organizations are usually dealing with relatively few sources of simple data, not a complex supply chain. The fewer, simpler event streams also require different sorts of analysis -- the signal-to-noise ratio in the data is much worse.
However, that's not the biggest problem with applicability. It's this gold-rush myth, that by mining a vast pile of data you'll surface some nugget. You might very well do that, but getting that insight to others in the organization, convincing them that the data is right, that the interpretation of the data is right, determining an action to take, promoting that action, making a decision, and then taking the action is not a single-person task. It's multiple people in different parts of the organization and at different levels. It's the equivalent of the long trip to get your gold from the mountains to a town via pack mule without being waylaid along the way. Analytics serves a valuable function, but the product of analytics still fits within organizational, technology, and social contexts.
Is this the other view you talk about?
Partly. The opposing view is that this is a bunch of technology vendors being driven on the edge of another Internet bubble, that the technology doesn't offer anything we can't do already, and that we don't need to change so much as renovate what we've already got.
Do you believe that?
No, not really. I think there's an element of truth to it, though. The hype is there, and the vendors are scrambling for customers because there are only so many Web startups to sell to, so they keep pumping the hype machine.
There's also the aspect of what fits with IT. I think that's a big gap for many of these vendors, growing out of high tech companies with development staff. Most custom development in conventional IT is done by consultants and outsourcers. The technologies are pretty raw and there's a skills gap, so it's hard to deploy and manage the new stuff when compared to the tools in use with databases and analytics tools such as SAS or SPSS.
The technology context is big, too. You might use this new tech to process data or generate insights, but how do you get data to it and, more important, how do you do some of the things that are needed such as publishing, displaying, and sharing? We need to marry the data management and delivery capabilities we have today with the new technologies. I see that as the real work ahead.
Are the analytical problems posed by big data truly new? Are they unaddressable or insurmountable using existing tools?
Some of the problems are, yes. The analysis techniques are pretty much the same, but the scale of data is different, as are the types of data. In the past, we were often analyzing transactions as indicators of behavior. With the monitoring built into Web sites, applications, mobile devices -- just about everything today -- there are constant event streams of data that flow with and around the transactions.
The event streams flow in real time, and that's a problem for a model of information use built around extracting data from databases and files and loading it into read-only databases over some longer latency. The approaches and even the technologies we have in IT aren't up to the task of data collection, let alone analysis in real time or with large-scale batches.
What's wrong with what we have? Can you give an example?
If you look at the parallel shared-nothing databases that appeared over the last ten years -- and which, by the way, have been mostly absorbed by the big vendors -- they were designed for this old world of data. They mostly perform best when data is bulk loaded and accessed afterward. That's not a real-time data flow monitoring solution, nor is it an analytic solution for high concurrency (for example, the algorithms behind a recommendation engine on a Web site). These are newer uses cases for most of IT.
The newer analytic problems are built around analyzing event streams and people's, or machine, or device behavior. We're not looking for simple correlation, but something different: causality. We want to answer the harder "why" questions. Teasing out causality in the more unpredictable data about behavior, and in less reliable and noisy event streams, is a harder problem to solve.
Add to that the data that is no longer directly under IT control, coming from static applications. External data, syndicated market data, data from multiple versions of different devices -- these are all new data sources. We have ETL tools that can handle any file or database, yet they can't talk the protocols that drive the new software applications in the cloud or the massaging protocols used by devices.
It sounds like this changes some of our fundamental assumptions about things such as a data warehouse. We've been telling everyone for 20 years that they need one. Doesn't it place into question the role -- or the viability -- of the enterprise data warehouse, in particular?
I think it does, to an extent, but not so much the data warehouse as what we use it for. The old model had this single data warehouse holding all of an organization's data in a clean and orderly model. That turned out to be impractical. Clinging to that model, as some of the naysayers think we need to, jeopardizes the relevance of the BI and analytics groups that exist today.
What we're facing is an architectural shift. You can think of it like the Web 1.0 to Web 2.0 shift. Web 1.0 was largely read-only, publishing oriented, just like BI and the data warehouse. Web 2.0 is read-write, peer-to-peer at both a human and machine level. That was an architecture shift as much as a technology shift. We reconfigured the components in the architecture and created different ways to build and scale software.
The way we handle data is changing. A single database may still form the center for much of an organization's high value data. However, we also need to find a way to deal with the new external and event data which we don't use in the same way. It's of questionable value sometimes until someone works out a use or comes across a problem that was intractable before this data was around. You don't want huge volumes of noisy data in a data warehouse that's designed for reports, dashboards, and exploration. You want clean, better-understood data there.
We're in the midst of an architectural transition for data. We have the ability with technologies such as Hadoop and NoSQL databases to collect and store real-time streams, to monitor in near real-time, and to store in ways that allow us to apply models after the fact. Today, it's a very bottom-up effort to do anything in BI, with requirements, data modeling, ETL, and metadata layers to coordinate and build.
In essence we're talking about an architecture that manages data as infrastructure. At that layer, we manage and position data for different uses in the organization, whether low or high latency, low or high volume. Above the infrastructure layer we have the technologies that consume it, and different applications might consume it very differently.
There are really two roles around data, and when I listen to big data advocates, they make the same mistake we made with the data warehouse. They want to design it like an application within a single stack, when it needs to be designed as a layer of infrastructure, like a water supply in a town, not another house with a private well.
We need to start separating the use of data from the mechanisms of delivery. The nature of modern Web architectures enables more fluid access to services. We need to build the same things for data, and the big data evolution is a part of that.
Does big data require a big commitment, such that you can't "do" big data in a piecemeal, iterative, or selective fashion?
I think it's an evolution, not a revolution. There are applicable areas for new technologies that offer different capabilities. We're still going to use what we have, but we may move some of the processing to different areas, or store data in different places for different uses.
There are use cases for almost any industry, much like there were use cases for data mining in almost every industry back in the 90s. Like then, the uses are often tied to a specific area of the business, or to specific data. Over time, we'll see them grow and evolve as infrastructure, much the same way executive information systems grew into OLAP and into the data warehouse infrastructure we have today.
We also can't throw out what we have. The big data infrastructure offers a lot in terms of flexibility -- and performance for some uses -- but it also has gaps. The biggest is data management. All that "big data" is great, but if you can't link it to your master reference data, then you can't link it, or the insights you generate, back to the rest of the business.
We need to embrace the new, but we also have to remember how and why we arrived at the state we're in today. There are good reasons for some of what we do, just as there are good reasons to change or abandon outmoded ways of doing the work we do. So to answer your question directly, no, the commitment can be small and approached in a selective fashion. It's how we started doing BI and data warehouse projects in the late 80s.