RESEARCH & RESOURCES

Hadoop Usage Poised to Explode

Today, comparatively few enterprises are using HDFS, the distributed storage substrate of the Hadoop framework. That's quickly changing.

Comparatively few enterprises are currently using the Hadoop Distributed File System (HDFS), the distributed storage substrate of the Hadoop framework. Only about 10 percent of organizations are using HDFS in production today, according to survey data from TDWI Research. That said, a simply staggering proportion of organizations expect to be using Hadoop. All told, three-quarters (73 percent) of respondents have either deployed (10 percent) or expect to deploy (63 percent) HDFS in production.

Slightly more than a quarter (27 percent) of respondents say they don't have any HDFS deployment plans.

These are some of the more intriguing take-aways from Integrating Hadoop into Business Intelligence and Data Warehousing, a new report authored by Philip Russom, research director for data management with TDWI Research.

According to Russom, "Many business intelligence (BI) and data warehousing (DW) professionals are looking at HDFS, MapReduce, and other Hadoop technologies as ways to cost-effectively extend their existing BI/DW infrastructure. For example, many DW environments need a bigger and better data staging area, which HDFS can enable. Many BI programs need to embrace a broader range of analytic techniques, which MapReduce can do. Furthermore, very few BI and DW solutions as yet do anything serious with unstructured data, which a number of products in the Hadoop family can assist with."

The TDWI survey, based on a sample of 263 respondents, suggests that Hadoop adoption could ramp up very quickly: for example, more than one-quarter (28 percent) of respondents expect to be managing production deployments of HDFS in the next 12 months. Others expect their Hadoop deployments to come online more gradually: 24 months (13 percent), 36 months (10 percent), or more than three years (12 percent).

Not surprisingly, HDFS and MapReduce, the parallel processing counterpart to HDFS' distributed storage substrate, are today the two most used Hadoop technologies: just over two-thirds (67 percent) of Hadoop adopters use HDFS; an even bigger percentage -- 69 percent -- use MapReduce. This, too, isn't surprising, Russom explains, because some Hadoop vendors (e.g., MapR) implement proprietary file systems. MapReduce itself has been implemented in many contexts: for example, two prominent analytic database platforms (Teradata Corp.'s Aster Discovery and EMC Corp.'s Greenplum) have supported in-database MapReduce -- across their own MPP clusters -- for almost five years.

"The high MapReduce usage also explains why Java and R ranked fairly high in the survey; these programming languages are not Hadoop technologies per se, but are regularly used for the hand-coded logic that MapReduce executes," Russom writes.

"Likewise, Pig ranked high in the survey as a tool that enables developers to design logic -- for MapReduce execution -- without having to hand code it."

Outside of these, Russom and TDWI found that certain Hadoop technologies tend to be more popular than others -- at least among TDWI's core audience of BI and DW practitioners. For example, half of respondents plan to adopt Mahout (an open source machine learning library for Hadoop) within the next three years; 44 percent say the same about R, a programming and execution environment for statistical computing.

Finally, 42 percent likewise plan to adopt Zookeeper -- a fault-tolerant synchronization facility for distributed applications -- in the same three-year window.

Hcatalog Adoption Lags

Only 40 percent of respondents said they plan to use Hcatalog, which comprises a nominal metadata catalog for Hadoop. A high percentage, to be sure, but many BI and DW tools use Hcatalog to get structured information out of Hadoop.

"We do have support for Hcatalog," says Rick Glick, vice president of technology and architecture with ParAccel Inc., who says that Hcatalog is the primary programmatic means by which ParAccel gets information out of Hadoop.

That said, he continues, Hcatalog still isn't commonly used. "[Hcatalog is] more [common] than what else is out there, [although] there's also the Hive catalog," he continues. "Most users tend to build something themselves to let them know [what they're storing in Hadoop]. Everybody throws data in there with an eye to using it somehow, or simply [as a means] to archive it with a way [i.e., a customer-specific schema] to get it out. Yes, sometimes people use Hcatalog, but it's actually not commonly used." In most cases, Glick says, customers use a "brief schema definition of the files" in Hadoop in place of Hcatalog.

If Hcatalog's lagging adoption is a puzzle, that of other Hadoop technologies isn't.

For example, comparatively few BI or DW professionals expect to adopt Chukwa (4 percent) or Ambari (6 percent); the former focuses on large-scale log collection and analysis; the latter is a still-incubating Hadoop management project. Neither is an explicitly DM-oriented project. Over time, Russom expects that some laggards -- e.g., Hcatalog, Ambari -- will likely see increased adoption.

"BI professionals are accustomed to DBMSs, and so they long for a Hadoop-wide metadata store and far better tools for HDFS administration and monitoring," he writes. "These user needs are being addressed by HCatalog and Ambari, respectively, and therefore TDWI expects both to become more popular."

Russom's 36-page report addresses many aspects of Hadoop adoption and deployment. You can download it at no cost from TDWI's website.

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

Find the right level of Membership for you.