What Big Data is Really About
Ignore the hype surrounding big data. What's really important is to learn about the new models for data processing that big data is bringing so you can plan rather than react.
- By Mark Madsen
- January 22, 2013
[Editor's note: Mark Madsen is leading several sessions at the TDWI World Conference in Las Vegas February 17-22, 2013. ]
Big data isn't hype, but it is being hyped. There is substance to the technology shift happening in the broader data management market of which both business intelligence and big data are a part. The real question to ask is "what's different?"
The constant drone of the "three Vs of big data" we keep hearing in the media doesn't explain much. This focuses on a literal interpretation of big data, explaining it in terms of bigness (except when the data isn't big), variety (except when there isn't any variety in the data), and velocity (except when the data is processed in batch). So you can have big data without any of the three Vs, making this an empty definition.
Big data implies big, but is it? Many people are using the technologies to process moderate volumes of data, perhaps in the same way we use ETL. Big isn't necessarily a euphemism for unstructured data, either. Much of that data is structured. It might be log files, but those logs are events, generally easy to map to a relational table. It may be text, variably structured data, or the simple rows and columns we're used to.
The term implies that the shift is about data, but it's equally about technology. One assumption is that big data technology equals Hadoop. It's more than Hadoop. There are real-time data stores, processing and analytics engines, and streaming technologies for monitoring and processing data as it flows. Some are built on top of Hadoop or HDFS while others exist independently. What they usually share is an ability to be deployed in dynamically scalable configurations.
The reality is that big data is about new models for data processing. It's isn't some specific type of data, or huge volume of data, or specific technology. It's about applying new technologies to meet unfulfilled needs that (usually) can't be met by the traditional data warehouse architecture.
The areas where a data warehouse has difficulty are analytic processing, some types of data processing and transformation, and timeliness of development.
Some analytic processing is possible in SQL. Because of this, many database and analytic tool vendors say there's no need to change, or the answer is to use a more scalable database. Depending on the scale, the types of data, the user concurrency, and the algorithm, this may be true. It's equally possible that one of these elements limits the use of a database, which pushes the data warehouse to the side, as the source of the data that has to be moved, transformed, and processed elsewhere.
There are areas of basic data processing that the data warehouse technology stack has trouble with. This is less a failure of the database than of the data integration tools and the architecture. At large scale, processing becomes slow or expensive (or both). If the data is text or has a complex data structure, the DI tools may be poorly suited to the work. We end up in a situation where both the processing and the storage can be a mismatch to the tools we have available.
A constraint on the data warehouse today is the lack of agility: the response to changes or the need for rapidly cycling models and uses, as with much exploratory or experimental analytics. "Experimental" isn't restricted to scientific processing. A/B or multivariate testing of landing pages on a Web site is a form of experimentation. So are test marketing campaigns and staged product launches. It's a style of using information that means the data, models, and integrations need to be changed and updated rapidly -- something the data warehouse was not designed for. It was designed for predictable, or at least bounded, uses that didn't change significantly.
This is a constraint that is baked into the architecture. Models are static in the database, and both data integration and BI tools maintain mappings to the static database model. Each layer in the architecture managed independently, usually by different people. The capabilities in one layer are generally not available to tools in the other layers. Any change requires coordination from top to bottom. All this, and the need to model in advance, constrain the speed with which a data warehouse can react to change.
A data warehouse is good for storing the important data, and for delivery of information when the usage model is interactive query, as with BI tools and dashboards. It's less well suited to the read-write activities of exploratory analysis, the data-intensive processing of analytic models, and high-volume, low-latency, real-time workloads.
These are the usage models that various big data technologies were designed to address. They are designed more for these unmet needs than they are for the conventional workloads of BI. Because of this, they lack key features such as robust query support and good data management, to the point of sometimes lacking concepts such as metadata and schemas that are taken for granted in the data warehouse world. The features are missing because they aren't as important for non-BI uses.
Big data is about data processing and new usage models. As an architect or designer, it's helpful to look at what's different about the technologies available, the data being processed, and the uses. Big data isn't a replacement for data warehousing, nor is it an island to be maintained separately. It's part of the new IT environment in the same way data warehouses and BI were a new addition to the IT environment when there was only OLTP and batch reporting. We're still in the early stages of the market, making today the time to learn about the changes that are coming. Otherwise, you'll be reacting to changes instead of planning for them.