Can Hadoop Replace a Data Warehouse?
It depends on what you think a data warehouse is and what your organization is trying to do with it.
- By Philip Russom, Ph.D.
- January 27, 2015
Recently, I interviewed about 20 users of different types for the upcoming TDWI Best Practices Report on Hadoop for the enterprise. The users I spoke with ranged from seasoned data warehouse professionals to professionals who are better described as application developers who have limited data experience. Given the diversity of users (who come from diverse organizations with diverse requirements), I got diverse ideas about what a warehouse is (and is not), plus whether or not Hadoop can replace a data warehouse – whatever that is.
The upcoming report will include several user stories based on these interviews. Allow me to share a few of these now, including quotes from the participants -- in particular, the stories that address the definition of a data warehouse as well as how that influences if Hadoop can replace a data warehouse.
Data professionals tend to see Hadoop as an extension of the data warehouse architecture or general environment, sometimes with an eye toward economics, not technology, one person explained: "At this point, I personally don't believe Hadoop can replace a relational database management system, much less a relational data warehouse. [However,] I do believe we can reduce our footprint on expensive relational databases by migrating some data to Hadoop. That would make our data warehouse platform more affordable and free up capacity for growth, which in turn makes it look more valuable from an economic perspective."
Some organizations have limited experience with open source software (OSS) -- beyond Linux, which is everywhere. Others have a technical culture that eschews hand coding (which many OSS products require), so adopting Hadoop is a leap of faith. "We don't have a history of using open source, so IT is a bit uncomfortable with Hadoop. Our focus on vendor distributions of Hadoop in our ongoing 'proof of concept' evaluations helps IT feel more confident that they'll get the support, security, consulting, and administrative tools they need for the effective use of Hadoop. These aren't available from the open source community, which is why we're so sure that a vendor distribution is the way to go."
Before I'm convinced that a data warehouse is really a warehouse, I look for certain data structures, namely multidimensional data, time series, aggregates, and lots of calculated values that don't exist in source systems. In my experience, many so-called warehouses are better described as a mart, operational data store, archive, or simply a generic database. I freely admit that making distinctions among these data models is akin to hair splitting.
I'm not the only one who thinks this way. As one participant pointed out, "Our bespoke data warehouse isn't really a data warehouse. It's just an archive of relational data from a short list of packaged applications, along with lots of logs from miscellaneous applications. This is all we need, and we have no plans for a more sophisticated warehouse. All of the warehouse data has rather simple data models, and our tests have shown that the data is easily managed and queried in Hadoop, using mostly Hive and HBase, with a little Pig and MapReduce. So, we're planning to migrate the warehouse to Hadoop, to get it off of the highly expensive relational database it's on today."
Almost no one interviewed even mentioned reporting. The consensus is that Hadoop is a platform for advanced analytics, not the reporting, OLAP, and performance management that most data warehouses were built for. Therefore, Hadoop is a complement, not a replacement, as seen in the view of a data warehouse architect from the insurance industry. "Our enterprise data architecture group just did a study of Hadoop. ... The study determined that our first use cases for Hadoop should be extending existing risk analytics, improving the bottom line, mining social data, and bringing in more data from third parties. Other potential use cases will be fraud detection, mining data about annuities and indices, and cross-selling and up-selling."
Yet, Hadoop is not just for analytics. A growing number of users deploy Hadoop as a modernization strategy for their data archives. "I personally have in-house experience with data archiving and record management applications, which is our primary use of Hadoop. For us, archives are fully modern in that they are online and easily accessed in near-real time, unlike the offline backups of the past. Likewise, we follow modern practices in archiving in that we prepare data before committing it to the archive on Hadoop so the data is easier to access and better suited to complex search and query. Plus, we keep data lineage information so the archived data is fully documented and trustworthy in audits, investigations, and legal activities."
Note that this firm's "modern" data archive has characteristics we can associate with data warehouses, such as data preparation before loading and indexing the data for easier retrieval. We're currently seeing a convergence as warehouses and archives evolve to resemble more closely.
I'd like to end with a quote that summarizes many of the issues discussed here. "What is a data warehouse? Our view is that the data is the warehouse, and our data just happens to be managed with a relational database today. Our data could be managed on a non-relational platform, and it would still be a warehouse. ... [T]he idea that Hadoop would replace a warehouse is misguided [because] the data and its platform are two non-equivalent layers of the data warehouse architecture. It's more to the point to conjecture that Hadoop might replace an equivalent data platform, such as a relational database management system."
Even so, few users are even contemplating a warehouse replacement. Instead, many are actively migrating some of their warehouse (defined as data) to other platforms, including Hadoop, as well as data warehouse appliances, columnar databases, NoSQL databases, clouds, and event-processing tools. They do this to get platforms better suited to advanced analytics with the migrated data (and other specialized workloads). In fact, this movement toward multi-platform data warehouse environments is one of the strongest trends in data architecture today. For these users, the multi-platform environment is the warehouse, not just the relational warehouse platform. The relational warehouse platform continues its life cycle, but only with the data that absolutely requires the mature relational functionality of that platform.
All this may change. Hadoop adds more relational functionality almost daily; so it's possible that even die-hard relational users will eventually migrate inherently relational data and processing to Hadoop years from now. As the last user quoted pointed out, the data is the warehouse, so even if the warehouse's data migrates to Hadoop (or several types of data platform, as is the trend), the warehouse lives on.
If you're interested in reading the full TDWI Best Practices Report, Hadoop for the Enterprise, mark your calendar for April 1, 2015. That's when TDWI will publish the report as a PDF file on our website, and anyone can download the report for free after that date.