New Report Helps Users Get Up to Speed on Hadoop
A new report from TDWI helps users understand the basics of Hadoop and MapReduce.
- By Stephen Swoyer
- February 28, 2012
There's been a lot of hoopla over Hadoop, an open source software (OSS) big-data platform, available from the Apache Software Foundation. Much of it is deserved, inasmuch as Hadoop (and related technologies, such as MapReduce) can radically accelerate some workloads, especially those for big data and analytics.
A recent report from TDWI aims to help would-be Hadoop-users get a handle on the fundamentals of Hadoop, with a goal of understanding Hadoop's value for BI, data warehousing, and advanced analytics.
To do the technology justice, writes industry veteran Philip Russom, research director for data management with TDWI, would-be Hadoop-users must first get a sense for when and where Hadoop fits into BI and analytics programs and -- perhaps just as important -- when and where it doesn't.
"Hadoop is excellent for storing and searching multi-structured big data, but advanced analytics is possible only with certain combinations of Hadoop products, third-party products, or extensions of Hadoop technologies," writes Russom in Hadoop: Revealing Its True Value for Business Intelligence, in TDWI's latest in its series of Checklist Report publications.
Even now, business intelligence (BI) and data warehousing (DW) pros are still getting up to speed on Hadoop, Russom observes. Many are confused by a Hadoop stack that includes both a core offering (comprising both MapReduce and the Hadoop Distributed File System, or HDFS) and an ecosystem of related or complementary projects. Many users are simply perplexed by the hype and hubbub that have surrounded Hadoop almost from the outset.
In spite of this, Russom believes that "The business advantages of big data analytics are the leading reasons why BI/DW professionals need to know more about Hadoop now." Despite the hype and confusion, he argues, "Hadoop techniques will soon become a common complement to older BI/DW approaches."
The first reality check, says Russom, is that would-be adopters need to understand that Hadoop isn't a single technology, per se. According to Russom, Hadoop is an ecosystem, comprised of core Hadoop itself, related technologies (such as Hive, HBase, or Pig), and an ever-growing array of third-party offerings, including vendor-specific distributions of core Hadoop (i.e., MapReduce and HDFS).
Speaking of core Hadoop and its bread-and-butter file system, BI and DW pros should bear in mind that HDFS is not a database -- although in some respects it behaves like one. This has attendant advantages and disadvantages, Russom writes. "HDFS can query and index the data it manages, which makes it similar to a DBMS, but that doesn't make it a true DBMS," he points out.
"As a file system, HDFS manages files that contain data. Because it is file based, HDFS itself does not offer random access to data and has limited metadata capabilities when compared to a DBMS. Likewise, HDFS is strongly batch oriented, so it has limited real-time data access functions," Russom continues, noting that there are a few ways to work around this, e.g., by tapping Hive (which supports DBMS-like metadata capabilities) or by layering HBase over HDFS (which likewise confers DBMS-esque advantages).
"HBase is new and will no doubt improve, but today it's limited to straightforward tables and records with little support for more complex data structures," Russom stresses.
On the other hand, because HDFS isn't a DBMS, it's much better about handling unstructured data. Furthermore, because HDFS is a replication-based distributed file system, Russom notes, it boasts a built-in degree of fault-tolerance.
If ever a myth begged to be busted, it's that of MapReduce as all-powerful analytics. By itself, Russom stresses, MapReduce isn't even a strictly analytic offering.
"MapReduce is more of a general-purpose execution engine that works with a variety of storage technologies, including HDFS, other file systems, and some DBMSs," he points out. "As an execution engine, MapReduce and its underlying data platform handle the complexities of network communication, parallel programming, and fault tolerance. In addition, MapReduce controls hand-coded programs and automatically provides multithreading processes so they can execute in parallel for massive scalability. The controlled parallelization of MapReduce can apply to multiple types of distributed applications, not just analytic ones."
When it's applied to advanced analytics, Russom says, MapReduce inverts the existing BI and DW model, which brings the data to the tool. Hadoop, by contrast, brings MapReduce to the data. This -- along with the fact that it was designed for parallel processing -- is why Hadoop is such a slam-dunk technology for Big Data and advanced analytics.
"This is the reverse of older practices where we bring large quantities of transformed data to an analytic tool, especially those based on data mining or statistical analysis. As big data gets bigger, it's just not practical -- from both a time and cost perspective -- to move and process that much data," Russom writes.
Because MapReduce depends so extensively on hand coding, it flies in the face of a recent movement -- at least on the part of BI and DW pros -- away from hand-coded or manual approaches in the direction of vendor-developed tools.
Although MapReduce can support a dizzying array of programming or query languages, it doesn't -- by itself, anyway -- support SQL.
"Luckily, Hadoop Hive gets results that are similar to those of SQL, and Hive has a syntax that is similar to that of SQL," Russom explains. "Hence, BI/DW professionals who know SQL can learn Hive easily. Furthermore, as more vendors release ODBC and JDBC drivers, these enable BI professionals to develop in SQL while the driver handles translations to Hive and back."
This just scratches the surface of Russom's report, which addresses other Hadoop-related strengths -- such as data diversity -- and limitations. You can download the complete report at no cost here.