Picking the Right Platform: Big Data or Traditional Warehouse?
Big data or the data warehouse? Pick the wrong platform for the wrong workload and you could find yourself racking up hundreds of thousands (or even millions) of dollars in extra costs.
- By Stephen Swoyer
- December 17, 2013
The recent Teradata Inc.'s Partners Users Group (Partners) conference broke new ground in many ways -- not least with the framing of a question. This came courtesy of industry luminary Richard Winter, who outlined his vision of "TCOD" -- or "total cost of [big] data."
When, asked Winter, is it more sensible to use a big data platform and when is the traditional data warehouse (DW) a superior option? It's the kind of question that simply. would. not. have. been. asked at Partners events.
Winter, a pioneer in research and analysis of very large database (VLDB) platforms, said picking the wrong platform for the wrong job could rack up hundreds of thousands -- or hundreds of millions -- of dollars of unnecessary costs. In fact, he argued that misusing Hadoop for some types of decision support workloads could cost up to 2.8x more than a data warehouse.
"[This] cost is mainly in the complex queries and analytics. The problem is that it's much more expensive to create these queries in Java MapReduce than it is in a data warehouse technical environment," Winter said. "Each technology has its sweet spot, [and] each sweet spot delivers huge savings to the customer. However, if you get outside that sweet spot, it goes the other way."
TCOD Explained
Winter's papers on VLDB deployment, scale, and cost issues were required reading in the 1990s and 2000s. In his Partners presentation, he framed the issue with characteristic succinctness.
"Under what circumstances, in fact, does Hadoop save you a lot of money, and under what circumstances does a data warehouse save you a lot of money?" he asked, adding that TCOD, unlike other costing metrics, comprises a "complex cost estimating problem."
"If you want to look at the total cost of a project, total cost in the IT sense, what do you look at?"
It's in this respect, Winter argued, that traditional costing measures are inadequate. For example, a metric such as total cost of ownership (TCO) attempts to account for the (acquisition) cost of a system plus its ongoing maintenance. Missing from this is a slew of other costs, such as (with respect to decision support) the cost of developing and maintaining ETL, analytical applications, queries, and analytics -- along with the cost of upgrading the system over five years.
This takes into account the paradoxical cost of platform success: the more you give users (in terms of analytical applications, queries, or analytics), the more they'll want; paradoxically, then, the cost of a successful decision support or analytical platform tends to increase over time. "Costs grow as users find ways to leverage [the] value of investment," Winter explained.
For this reason, he built a CAGR of 26 percent over five years into his TCOD calculus. Elsewhere, TCOD uses published list prices per terabyte, per system: for Hadoop, this is $1,000 per TB; for the data warehouse, Winter says he "averaged" the prices of three "widely-used products" and also factored in the enterprise discount (usually 40 percent) most vendors offer. Salary information for full-time employees (FTE) was sourced from indeed.com; project costs (which also take into account lines of code contributed on a per-FTE basis) were sourced from qsm.com, which maintains a database with metrics derived from more than 10,000 completed software projects.
Winter found that a platform like Hadoop is significantly less expensive than a DW for some workloads. He used the example of "data refining," or the process by which manageable data sets are produced from the deluge of information generated by machines, sensors, applications, services, and so on. This casts Hadoop in the familiar "landing zone" role -- i.e., as an ingestion point for the landing and preparation of data for analysis. The cost economics of Hadoop are inversely related to those of the DW, Winter explained: Hadoop system costs tend to be drastically lower, Hadoop development costs tend to be much higher. The upshot, Winter argued, is that for a data-refining or landing-zone use case with 500 TB of storage, the cost of storage in a DW is several times that of Hadoop. Hadoop, then, is a significantly less expensive platform for these workloads.
Winter's enterprise data warehouse (EDW) comparison found just the opposite. He used the example of an EDW in a large enterprise environment, with 25 FTEs producing 10 new distinct complex queries and one new distinct analytic per day. Annually, these FTEs produce 300,000 lines of new code for analytical applications. The data volume baseline of this system, too, is 500 TB.
In both cases, its cost is staggering: Winter pegs the cost of an MPP-powered DW at a combined $265 million over five years. This is high, to be sure, but it's a fraction of the cost of its Hadoop equivalent, which costs 2.8x as much -- or approximately $740 million.
Winter added a few minor caveats. First, he said, TCOD works across a range of volumes: "My assumption was 500 TB [for the EDW use case]; at 50 TB, the savings is actually increasing with the data warehouse over Hadoop," he explained. "You can go way up in volume and still see a similar dynamic."
In addition, Winter's TCOD estimates don't take into account workload modeling and capacity management, which he says could produce different numbers. For the decision support use case, TCOD also uses vanilla Hive in place of new projects -- such as Impala -- which graft an interactive query facility onto Hadoop. Even though a technology such as Impala is faster than Hive, it's still considerably slower (as a query platform) than an MPP-powered data warehouse.
There's also the fact that some MPP data warehouse platforms -- such as Teradata -- incorporate advanced tuning and workload management features. These take years to develop and are mostly missing from Hadoop: "Simple queries aren't really completely equivalent on Hadoop and the data warehouse; if you have to do a lot of them on Hadoop, you get into issues of concurrency, and if you have to do a lot of concurrent work with different ... objectives, you get into workload management."
Winter wrapped up his presentation with a pragmatic assessment. "When you're launching a new initiative, creating a major new workload, bringing a major new source of data into your environment, you want to make an informed decision, taking into account along with other factors the total cost you're likely to see over time," he said. "This framework gives you a way to do that. The examples show that total cost is very sensitive to the choice of technology, so it's dangerous to think that all of your requirements are going to play out the same way in terms of cost."