Taking the Sting Out of Hadoop's Growing Pains

Hadoop's growth spurt has produced growing pains -- which the Hadoop community has worked feverishly to address. These efforts are bearing demonstrable fruit.

Hadoop is in the midst of a significant growth spurt.

The thing about growth spurts is that they inevitably produce growing pains -- which the Hadoop community has worked feverishly to address. These efforts are bearing demonstrable fruit.

Take the long-awaited Yet Another Resource Negotitiator (YARN) project, which officially debuted with version 2.2 of the Hadoop framework last October. Prior to YARN, Hadoop used a pair of daemons -- JobTracker and TaskTracker -- to handle resource negotiation. Both were conceived and designed with Hadoop's MapReduce compute engine in mind: they effectively presuppose MapReduce. In this regard, it's by no means a stretch to describe YARN as analogous to the "permanent" teeth that replace the "baby" (or "primary") teeth that develop during human infancy.

In a sense, Hadoop's MapReduce-centrism was essential to its early growth and development. Hadoop MapReduce exposes an open, reasonably straightforward parallel programming model. If it doesn't quite democratize parallel programming -- coding for MapReduce requires Java programming chops, expertise in other procedural languages, or knowledge of Pig Latin (coding for data management-specific MapReduce jobs, such as ETL processing, requires additional specialization) -- it certainly lowers the bar.

The thing is, MapReduce is a brute-force data processing tool: it might be ideal for certain kinds of workloads, but it's less ideal for others. As the Hadoop platform pushes deeper into the enterprise core -- e.g., from primary use in test-bed, development, or skunk-work scenarios to use in production environments -- its MapReduce-centrism becomes problematic: IT organizations and ISVs will increasingly want to run optimized workloads on their Hadoop clusters. Prior to YARN, it was possible to run non-MapReduce workloads in a Hadoop cluster, but it wasn't possible to use Hadoop's vanilla JobTracker and TaskTracker to manage them. Now that YARN is available, users should finally be able to manage, monitor, and scale mixed workloads in the Hadoop environment.

That's big, says Webster Mudge, senior director of technology solutions with Cloudera Inc. -- but it isn't as big as you might think. "We are really happy that YARN is now [generally available]. We've been running YARN for over a year now [internally], and it is the default resource management container for Cloudera Hadoop 5 and thus [for] Cloudera Enterprise. However, it's not the only answer to resource management," says Mudge, who argues that for some workloads -- for example, long-lived, kernel-level applications, and extremely short-lived applications -- YARN is insufficient or sub-optimal.

"YARN is a generalized container for resource management within Hadoop. If you think about how MapReduce the batch programming language was very tightly coupled with [the first versions of Hadoop], the earlier versions expected that you were going to be running MapReduce, so that's how [the Hadoop platform] did its resource management and the like. YARN is a big improvement, but you shouldn't believe the hype that it's what you need [alone] for multi-tenancy. It's one of the things you need, but YARN [by itself] doesn't give you security, governance, and management."

These disciplines -- along with failover/disaster recovery -- arguably account for Hadoop's biggest or most intractable growing pains. At this point, for example, "integrated" Hadoop management is -- by data management standards -- primitive. A presentation on precisely this topic at last year's Strata+Hadoop World conference focused on a single Hadoop distribution (Cloudera Enterprise) and a single GUI-based management tool (Cloudera Manager). It nonetheless made extensive use of a command-line interface (CLI) and CLI-based scripts.

What's more, until recently, Hadoop lacked important features security features such as native role-based access control (RBAC) or support for volume-level encryption, which are checklist items for most large enterprise customers.

Last July, however, Cloudera kicked off "Sentry," an Apache-licensed OSS project for Hadoop.

At a minimum, Sentry aims to provide role-based authorization capabilities for Hadoop services such as Hive (a SQL-like interpreter for Hadoop that compiles MapReduce jobs) and Impala (an interactive SQL query facility for Hadoop).

Cloudera has an even bigger vision, however. "We're starting to see [Hadoop] as a focal point for this granular control for all data sets, data types, data engines within Hadoop," says Mudge. "This means starting with the most common, the SQL-based [data types or data engines] -- so Impala and Hive will share the same privilege model because they share the same metadata model."

Beyond this, Cloudera casts Sentry as a central security authority -- not just for Hadoop (which Cloudera likewise locates at the enterprise core -- i.e., as an enterprise data hub), but for apps or services of every kind, which will be able to hook into it and use it as a provider. Mudge cites Apache Solr -- an open source search and content management facility -- as one such example.

Sentry is still gestating, however. Currently, Cloudera's commercial distribution of Hadoop (Cloudera Enterprise) relies on products from partners to deliver advanced security features, such as data masking, tokenization, and volume-level encryption. "What you're seeing is that within the data protection layer of Hadoop, Hadoop itself doesn't necessarily provide those capabilities out of the box, but it relies on the insertion points in the substrate for partners," says Mudge.

One Hadoop-focused vendor that isn't a Cloudera partner is Zettaset Inc. It nonetheless partners with a veritable Who's Who of DM players, including Actian Corp., IBM Corp., Informatica Corp., MicroStrategy Corp., and Teradata Corp. Zettaset's specialty? Nothing less than "secure big data management," says president and CEO Jim Vogt.

"What we've built is enterprise software [Zettaset Orchestrator] that rides up on top of open source software and hardens it for the enterprise. Our focus is on ease of management, scale, performance, and security," explains Vogt, who argues that "the big thing holding up [production] deployments [of Hadoop] is security. [Adopters] have [security] mandates or requirements they just can't meet" using free or commercial distributions of Hadoop.

Vogt explicitly contrasts Zettaset's approach -- "We file patents, we don't just donate everything back to the community." -- with those of players such as Cloudera and Hortonworks Inc. "We support [volume-level] encryption," he explains, "and we have some patents that we've filed around FLASH-aware and SAN-aware [encryption] so that you can optimize based on that."

Zettaset is extending its RBAC facility to interoperate with the BI and DBMS offerings of its partners. "We have role-based access control that is very granular. We have an API for RBAC, and we opened up our security framework across [Hadoop] distributions and across applications or databases. We can integrate with MicroStrategy, with Teradata, with Hortonworks," says Vogt.

"One key partner is Informatica, and they allow us to do ETL and data transfer and also some basic visualization. What they like about [partnering with] us is, they were certifying [Hadoop] distribution by [Hadoop] distribution, but if they use Orchestrator, they can certify for just us."

TDWI Membership

Get immediate access to training discounts, video library, BI Teams, Skills, Budget Report, and more

Individual, Student, & Team memberships available.