Executive Summary: Integrating Hadoop into Business Intelligence and Data Warehousing
Hadoop promises to assist with the toughest challenges in BI today, including big data, advanced analytics, and multi-structured data. Download this TDWI Best Practices Report to learn how to integrate Hadoop into your business intelligence, analytics, data integration, and data warehousing technology stacks.
- By Philip Russom, Ph.D.
- April 1, 2013
Apache Hadoop is an open source software project administered by the Apache Software Foundation
(ASF). The Hadoop family of products includes the Hadoop Distributed File System (HDFS),
MapReduce, Pig, Hive, HBase, and so on. These products are available as open source from ASF, as
well as from several software vendors. The number of vendor products that integrate with Hadoop
products increases almost daily. In this report, the term “Hadoop” usually means the entire Hadoop
family of products, regardless of their open source or vendor origins. Some discussions focus
specifically on HDFS.
Business intelligence (BI) professionals’ interest in Hadoop has been driven up in recent years
because Hadoop has proved its usefulness with the toughest challenges in BI today, namely big data,
advanced analytics, and multi-structured data. For that reason, TDWI anticipates that Hadoop
technologies will soon become a common complement to (but not a replacement for) established
products and practices for business intelligence (BI), data warehousing (DW), data integration (DI),
and analytics. Therefore, a wide range of user organizations need to prepare for Hadoop usage.
Although it’s true that Hadoop can be valuable as an analytic silo, most organizations will prefer to
get the most business value out of Hadoop by integrating it with—or into—their BI, DW, DI, and
analytics technology stacks.
According to this report’s survey, users with hands-on Hadoop experience say it’s still immature and
needs serious improvements in security, administrative tools, high availability, and real-time
operation. These and other problems are being addressed by the open source community of technical
users, which continues to infuse innovation into existing Hadoop products as well as introduce new
ones via ASF’s incubation process. The pace of Hadoop innovation has accelerated because a number
of software vendor firms now contribute to Hadoop’s open source. The first wave of support for
Hadoop technologies by vendor tools and platforms is already in place, with subsequent waves
coming soon. The number of technical users conversant in Hadoop is increasing steadily.
According to this report’s survey, the Hadoop products most commonly used today are (in priority
order) MapReduce, HDFS, Java, Hive, HBase, and Pig. Those poised for greatest future adoption are
Mahout, Zookeeper, and HCatalog. All of these have compelling use cases for BI, DW, DI, and
analytics. In fact, survey respondents who have hands-on Hadoop experience say they’ve already
integrated Hadoop with analytic tools, DWs, reporting tools, Web servers, analytic databases, and
data visualization tools—showing that Hadoop is already established as a component within BI/DW
technology stacks. Of these respondents, 78% feel Hadoop is a complement to a DW, not a
replacement. Enabling big data analytics is the leading benefit of Hadoop, whereas a lack of Hadoop
skills is the leading barrier. BI/DW aside, a few respondents also anticipate using Hadoop as a live
archive (23%) or as a platform for content management (35%).
Only 10% of organizations surveyed have a Hadoop implementation in production today, but a
whopping 51% say they’ll have one within three years. If this trend pans out, Hadoop will impact at
least half of BI/DW environments soon. Hence, users need to prepare for Hadoop usage now.
The purpose of this report is to accelerate users’ understanding of the many new Hadoop-based
products that have emerged in recent years. The report also maps newly available Hadoop options
to real-world use cases. This information can help user organizations successfully integrate Hadoop
technologies into their BI portfolios and practices with maximum business value.
Cloudera, EMC Greenplum, Hortonworks, ParAccel, Pentaho, SAP, SAS, Tableau Software, and Teradata
sponsored the research for this report.
Philip Russom, Ph.D., is senior director of TDWI Research for data management and is a well-known figure in data warehousing, integration, and quality, having published over 600 research reports, magazine articles, opinion columns, and speeches over a 20-year period. Before joining TDWI in 2005, Russom was an industry analyst covering data management at Forrester Research and Giga Information Group. He also ran his own business as an independent industry analyst and consultant, was a contributing editor with leading IT magazines, and a product manager at database vendors. His Ph.D. is from Yale. You can reach him by email (email@example.com), on Twitter (twitter.com/prussom), and on LinkedIn (linkedin.com/in/philiprussom).