RESEARCH & RESOURCES

Hadoop and the Extended Data Warehouse Environment

A recent report from TDWI Research makes the case for a new kind of EDW -- the "extended data warehouse" environment. Think of it as the EDW-e.

A recent report from TDWI Research makes the case for a new kind of EDW -- the "extended data warehouse" environment, or EDW-e.

EDW-e is shorthand for a data warehouse and decision support architecture that includes Hadoop. Think of it as Hadoop reconciled to the DW: i.e., Hadoop leveraged in the use cases or applications for which it's best suited. This is the problem that TDWI's Philip Russom tackles in his new report, Where Hadoop Fits in Your Data Warehouse Architecture. The idea is two-fold: first, to extend the traditional DW -- chiefly by making what Russom describes as a few "tweaks" -- and, second, to leverage Hadoop's strengths. Russom, research director for data management with TDWI Research, has been using the term "extended data warehouse environment" for some time.

In a March interview, for example, he discussed what he called the "EDW-e environment," that isn't "one DW, it's actually lots of different platforms from many different vendors." At the time, Russom offered a succinct description of the problem. "We have a lot more data processing workloads today than ever before. We have newer workloads around real-time, analytics, and unstructured data that the data warehouse was not designed for," Russom told BI This Week, "but that's okay because you can have secondary platforms within the extended data warehouse environment that are well suited to those workloads."

On Russom's terms, Hadoop is one such secondary platform, but where does Hadoop fit into the DW-driven business intelligence (BI) environment and what changes must data management (DM) practitioners make to accommodate it? Both questions have surprisingly straightforward answers, according to Russom.

"There are multiple areas in data warehouse architectures where Hadoop products can contribute," he writes, noting that key aspects of Hadoop's design -- its high latency levels, its batch-centric orientation, and its file system underpinnings -- make Hadoop a poor choice for query-centric decision support workloads. For example, Hadoop's query facility is Hive, which is a SQL-like interpreter that compiles queries (written in Hive Query Language, or HQL; HQL is not SQL) into MapReduce jobs. "At the moment, Hadoop seems most compelling as a data platform for capturing and storing big data within an extended DW environment, plus processing that data for analytic purposes on other platforms."

This (the so-called "Hadoop landing zone") is the application for which a number of organizations use Hadoop today, Russom indicates -- i.e., as a platform for landing, processing (transforming it and conforming it for ingestion by a data warehouse) and staging. In this scheme, data is processed and prepared on Hadoop, then moved to an RDBMS. Because of its primitive DM feature set, Russom notes, Hadoop isn't ideal as a storehouse for frequently-accessed business information -- e.g., for the kind of structured data that could be better stored (albeit at greater expense) in an RDBMS.

For one thing, Hadoop is primarily queried via Hive and HQL -- although proprietary, vendor-specific Hadoop SQL alternatives do exist -- and so isn't ANSI SQL compliant. It likewise isn't ACID-compliant. On the other hand, Hadoop can function as an adequate -- and, most important, cost-effective -- data archive.

"Traditionally, enterprises had three options when it came to archiving data: leave it within a relational database, move it to tape, or delete it. Hadoop's scalability and low cost enable organizations to keep all data forever in a readily accessible online environment," Russom writes. Hadoop's file-based underpinnings also make it a good store for non-traditional data types, including what Russom calls "evolving schema" (e.g., A/B testing and multivariate tests) and "no schema" (e.g., audio and video files) content.

Russom's report addresses a number of other issues, including changing trends in data warehouse architectures; the practical use of Hadoop as a staging area, particularly from a data integration perspective; the issues of data archiving and multi-structured data; and -- of course -- advanced analytics, as a platform for which Hadoop is often touted.

You can download the report here. (A short registration is required for readers downloading TDWI content for the first time.)

TDWI Membership

Get immediate access to training discounts, video library, BI Teams, Skills, Budget Report, and more

Individual, Student, & Team memberships available.