Evolving Data Warehouse Architectures: Integrating HDFS with an RDBMS Alleviates the Limitations of Both
Hadoop has limitations. But the relational database management systems used for data warehousing do, too. Luckily, their strengths are complementary.
By Philip Russom, TDWI Research Director for Data Management
In a recent blog in this series, I discussed “The Roles of Hadoop” in evolving data warehouse architectures. (There’s a link to that blog at the end of this blog.) In response, a few people asked me (I’m paraphrasing): “Since the Hadoop Distributed File System (HDFS) is so useful, can it replace the relational database management system (RDBMS) that’s at the base of my current data warehouse and its architecture?”
The short answer is: “No.” The long answer is: “Not today, and probably not in the future.” The main reason is that Hadoop—in its current form—lacks (or is weak with) many of the functions that we depend on in our RDBMSs. As you’ll see in the list below, most of the RDBMS functions I have in mind enable feature-rich and high-performance access to stored data via SQL. Other functions concerns tools for data security and administration.
Just so you know where this blog is going: Hadoop has limitations, but the average data warehouse does, too. Luckily, the strengths and weaknesses of the two are complementary (for the most part). When you integrate Hadoop and an RDBMS, they fill in each other’s holes and provide a more broadly capable data warehouse architecture than has been possible until now.
Hadoop’s Limitations Relative to RDBMSs Used for Data Warehousing
Despite all the goodness of Hadoop I described in a previous blog, there are areas within data warehouse architectures where HDFS isn’t such a good fit:
HDFS is a distributed file system and therefore lacks capabilities we expect from relational database management systems (RDBMSs), such as indexing, random access to data, support for standard SQL, and query optimization. But that’s okay, because HDFS does things RDBMSs do not do as well, such as managing and processing massive volumes of file-based, unstructured data. For minimal DBMS functionality (though not fully relational), users can layer HBase over HDFS, as well as the query framework called Hive.
Low-latency data access and queries.
HDFS’s batch-oriented, serial-execution engine means that it’s not the best platform for real-time or speedy data access or queries. Furthermore, Hadoop lacks mature query optimization. Hence, the selective random access to data and iterative ad hoc queries that we take for granted with RDBMSs are alien to Hadoop.
An RDBMS integrated with Hadoop can provide needed query support. HBase is a possible solution, if all you need is a record store, not a full-blown DBMS. And upcoming improvements to Hadoop Hive and the new Impala query engine will address some of the latency issues.
HDFS and other Hadoop products can capture data from streaming sources (Web servers, sensors, machinery) and append it to files. But, being inherently batch, they are ill-equipped to process that data in real time. In my opinion at this date, such extremes of real-time analytics are best done with specialized tools for complex event processing (CEP) and/or operational intelligence (OI) from third-party vendors.
Hadoop today includes a few security features, such as file-permission checks, access control for job queues, and service-level authorization. Add-on products that provide encryption and LDAP integration are available for Hadoop from a few third-party vendors. Since HDFS is not a DBMS (and Hadoop data doesn’t necessarily come in relational structures), don’t expect granular security at the row or field level, as in an RDBMS.
According to a TDWI survey, security is Hadoop users’ most pressing need, followed by a need for better administrative tools, especially for cluster deployment and maintenance. The good news is that a few vendors offer tools for Hadoop administration, and an upgrade of open-source Ambari is coming.
With the above latency limitations in mind, HDFS is a problematic choice for workloads that are iterative and query based, as with SQL-based analytics. Furthermore, Hadoop products today have limited support for standard SQL. A number of vendor products (from RDBMSs to data integration and reporting tools) can provide SQL support today, and open-source Hadoop has new incubator projects that will eventually provide adequate support for SQL. These are critical if Hadoop is to become a productive part of a SQL-driven data warehouse architecture.
I’m not making up these limitations for Hadoop. The list is based on survey results and user interviews, as reported in my 2013 TDWI Best Practices Report: Integrating Hadoop into BI and Data Warehousing
In Defense of Hadoop
My list of limitations might seem like “Hadoop bashing” to some readers, but that is not what I intend. So let me restate what I stated positively in the last blog: “HDFS and other Hadoop tools promise to extend and improve some areas within data warehouse architectures.”
Sure, Hadoop’s help is limited to “some areas.” But the fantastically fortuitous fact is that most of Hadoop’s strengths are in areas where most warehouses and BI technology stacks are weak, such as unstructured data, outrageously large data sets, non-SQL algorithmic analytics, and the flood of files that’s drowning many of us. Conversely, Hadoop’s limitations (as discussed above) are mostly met by mature functionality available today from a wide range of RDBMS types (OLTP databases, columnar databases, DW appliances, etc.), plus admin tools. In that light, I hope it’s clear that Hadoop and the average data warehouse are complementary (despite a bit of overlap), so it’s unlikely that one could replace the other, as I am often asked.
Integrating Hadoop with an RDBMS Alleviates the Limitations of Both
The trick, of course, is making HDFS and an RDBMS work together optimally. To that end, one of the critical success factors for assimilating Hadoop into evolving data warehouse architectures is the improvement of interfaces and interoperability between HDFS and RDBMSs. Luckily, this is well under way, due to efforts from software vendors and the open source community. And technical users are starting to leverage HDFS/RDBMS integration.
For example, an emerging best practice among DW professionals with Hadoop experience is to manage diverse big data in HDFS, but process it and move the results (via ETL or other data integration media) to RDBMSs (elsewhere in the DW architecture) that are more conducive to SQL-based analytics. HDFS serves as a massive data staging area. A similar best practice is to use an RDBMS as a front-end to HDFS data; this way, data is moved via queries (whether ad hoc or standardized), not via ETL jobs. HDFS serves as a large, diverse operational data store. For more information about these practices, replay my recent TDWI Webinar: "Ad Hoc Query Speed for Hadoop
Other Blogs in the Evolving Data Warehouse Architectures series:
• From EDW to DWE
• The Role(s) of Hadoop
Posted by Philip Russom, Ph.D. on September 2, 2013