Big Data and Meaningful Storage Metrics
Big data has the potential to alter the calculus by which data management groups buy, manage, and structure information storage.
- By Stephen Swoyer
- June 18, 2013
Big data isn't just a force for technological transformation -- it is, or could be, an active force for economic transformation, too. For example, argues Philip Russom, director of research in data management with TDWI Research, big data has the potential to alter the calculus by which data management (DM) groups buy, manage, and structure information storage.
Most groups just don't know it yet.
"I think, as we find a place for Hadoop in the data warehouse architecture, we really ought to revisit the economics, especially using a metric such as dollars-per-TB," Russom argues.
One issue, he concedes, is that a dollars-per-TB metric is still relatively fuzzy.
"A lot of times, people get kind of nervous and say, 'We can't do that. We've tried to come up with a super-accurate metric of dollars-per-TB and don't ever feel that it's accurate enough," Russom explains. "The reality is that most metrics are fuzzy, so just let go of the idea of having perfect metrics. I think that in this case, it's better to have a metric even if it's fuzzy -- or less than optimal --than not to have a metric of any kind at all."
Metrics was also mentioned in a keynote address at last month's TDWI World Conference in Chicago. Ken Rudin -- head of analytics for social media powerhouse Facebook -- distinguished between metric sufficiency and metric perfection. In too many cases, Rudin noted, organizations become consumed with developing just the right metric. One upshot of this is that metrics simply don't get used -- or (as a function of the cost and effort it takes to "perfect" them) metrics don't get used as efficaciously as they could be.
"[M]aybe you can't do a perfectly statistically controlled A/B test, but ... there's always some way ... to figure out how we've improved versus historical trends," Rudin observed, invoking the experimental spirit of analysis. "[I]t's the spirit of the experimentation versus the actual statistical significance of it that actually makes all of the difference. The spirit is, I don't sit there and say, 'Should we do A or should we do B? Let's make a decision.'
"The spirit is, 'Let's [use our metrics to] narrow it down' ... instead of saying, 'I know [what to do] and we're going to do this,'" Rudin urged. "I think that only works for Steve Jobs."
A Meaningful Metric
As a metric for assessing the cost of storage, dollars-per-TB is both measurable and intelligible. "I think most people will agree without really running the numbers that the most expensive [platform in terms of] dollars-per-TB would be a traditional DBMS-based, data warehouse," Russom says. "At the other end of the economic spectrum, if we look at the price-per-TB of the Hadoop distributed file system (or HDFS), at least until it gets above the 200-server mark, it's really the most affordable storage for certain kinds of data. Somewhere between those two economic extremes would be mid-priced data platforms like data warehouse appliances and columnar databases."
He uses detail data as a proof of concept. As a function of its newfound value -- for analytic discovery, investigative computing, and other practices -- a growing number of DM groups opt to persist detail data.
"In the old days, we'd extract operational data from ERP, CRM, or other systems and our ETL would work on this data in the middle of the night, take that [conformed and prepared] data as a result, load it into the data warehouse … and then delete the [source] data from the staging area," Russom explains.
"Nowadays, we're keeping a huge store of detail data in storage volumes we've never seen before, because more organizations are moving into analytic technologies that actually work best off of raw source data, as opposed to that squeaky-clean data that we're used to loading into the warehouse."
The worst place to keep this data is in a data warehouse or operational data store (ODS), Russom notes. This is precisely what many DM teams are doing, however.
"Many organizations are storing that detailed source data ... on the expensive data warehouse itself. It would be a lot more cost-effective to just keep it on HDFS," he says, noting that this is, in fact, what some DM teams have started to do. "For much of the analytic processing we do with detailed source data, HDFS can also be a favorable choice in terms of matching a workload to a platform best suited to it."
"One use of the traditional ODS has been as a large storage area for raw data. Again, Hadoop is extraordinarily scalable with that kind of data, assuming it's in files, which a lot of it is," Russom points out, referring to flat files, CSV files, or similar file-based extracts from operational systems. "We're seeing some organizations prototyping with HDFS as a bigger and better operational data store for the permanent storage of detailed source data."
In this case, the calculus is relatively straightforward. In others -- particularly as regards the cost of a dedicated analytic platform versus that of a traditional data warehouse for certain kinds of storage or workloads -- it can get fuzzier, Russom concedes.
On the other hand, he's skeptical of an attempt to systematize the problem: there likely isn't a "golden algorithm" for figuring this out. "If we learn nothing from business intelligence and data warehousing, it's that every organization is very different in terms of the collection of sources that they have, in-house skills, deployed platforms, what the organization wants to do with its data, and so on," he explains..
"I would encourage organizations to take this [dollars-per-TB] as a bare-bones metric and create their own algorithms out of that. If you spend too much time trying to create the 'perfect' algorithm, you'll just bog down and never get to the actual goal of thinking about where you put data and where you process data, [as a function of cost]."