TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Data Integration for AI: Overcoming Modern Pipeline Challenges July 23, 2025
  - From Silos to Insights: Centralizing Data to Drive AI July 24, 2025
  - Expert Panel: Leveraging AI-Powered Solutions for Data Management July 28, 2025
  - A Generative AI Framework for Credit and Financial Markets July 29, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
  - Executive Summit TDWI Data & AI Leaders Summit Orlando: Governing Data, Analytics, and AI November 17, 2025
- Virtual Live Seminars
  - Data Governance Week July 21, 2025
  - Platforms & Architecture Week July 21, 2025
  - AI Bootcamp Week July 21, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

RESEARCH & RESOURCES

Taking the Sting Out of Hadoop's Growing Pains

Hadoop's growth spurt has produced growing pains -- which the Hadoop community has worked feverishly to address. These efforts are bearing demonstrable fruit.

By Stephen Swoyer
February 4, 2014

Hadoop is in the midst of a significant growth spurt.

The thing about growth spurts is that they inevitably produce growing pains -- which the Hadoop community has worked feverishly to address. These efforts are bearing demonstrable fruit.

Take the long-awaited Yet Another Resource Negotitiator (YARN) project, which officially debuted with version 2.2 of the Hadoop framework last October. Prior to YARN, Hadoop used a pair of daemons -- JobTracker and TaskTracker -- to handle resource negotiation. Both were conceived and designed with Hadoop's MapReduce compute engine in mind: they effectively presuppose MapReduce. In this regard, it's by no means a stretch to describe YARN as analogous to the "permanent" teeth that replace the "baby" (or "primary") teeth that develop during human infancy.

In a sense, Hadoop's MapReduce-centrism was essential to its early growth and development. Hadoop MapReduce exposes an open, reasonably straightforward parallel programming model. If it doesn't quite democratize parallel programming -- coding for MapReduce requires Java programming chops, expertise in other procedural languages, or knowledge of Pig Latin (coding for data management-specific MapReduce jobs, such as ETL processing, requires additional specialization) -- it certainly lowers the bar.

The thing is, MapReduce is a brute-force data processing tool: it might be ideal for certain kinds of workloads, but it's less ideal for others. As the Hadoop platform pushes deeper into the enterprise core -- e.g., from primary use in test-bed, development, or skunk-work scenarios to use in production environments -- its MapReduce-centrism becomes problematic: IT organizations and ISVs will increasingly want to run optimized workloads on their Hadoop clusters. Prior to YARN, it was possible to run non-MapReduce workloads in a Hadoop cluster, but it wasn't possible to use Hadoop's vanilla JobTracker and TaskTracker to manage them. Now that YARN is available, users should finally be able to manage, monitor, and scale mixed workloads in the Hadoop environment.

That's big, says Webster Mudge, senior director of technology solutions with Cloudera Inc. -- but it isn't as big as you might think. "We are really happy that YARN is now [generally available]. We've been running YARN for over a year now [internally], and it is the default resource management container for Cloudera Hadoop 5 and thus [for] Cloudera Enterprise. However, it's not the only answer to resource management," says Mudge, who argues that for some workloads -- for example, long-lived, kernel-level applications, and extremely short-lived applications -- YARN is insufficient or sub-optimal.

"YARN is a generalized container for resource management within Hadoop. If you think about how MapReduce the batch programming language was very tightly coupled with [the first versions of Hadoop], the earlier versions expected that you were going to be running MapReduce, so that's how [the Hadoop platform] did its resource management and the like. YARN is a big improvement, but you shouldn't believe the hype that it's what you need [alone] for multi-tenancy. It's one of the things you need, but YARN [by itself] doesn't give you security, governance, and management."

These disciplines -- along with failover/disaster recovery -- arguably account for Hadoop's biggest or most intractable growing pains. At this point, for example, "integrated" Hadoop management is -- by data management standards -- primitive. A presentation on precisely this topic at last year's Strata+Hadoop World conference focused on a single Hadoop distribution (Cloudera Enterprise) and a single GUI-based management tool (Cloudera Manager). It nonetheless made extensive use of a command-line interface (CLI) and CLI-based scripts.

What's more, until recently, Hadoop lacked important features security features such as native role-based access control (RBAC) or support for volume-level encryption, which are checklist items for most large enterprise customers.

Last July, however, Cloudera kicked off "Sentry," an Apache-licensed OSS project for Hadoop.

At a minimum, Sentry aims to provide role-based authorization capabilities for Hadoop services such as Hive (a SQL-like interpreter for Hadoop that compiles MapReduce jobs) and Impala (an interactive SQL query facility for Hadoop).

Cloudera has an even bigger vision, however. "We're starting to see [Hadoop] as a focal point for this granular control for all data sets, data types, data engines within Hadoop," says Mudge. "This means starting with the most common, the SQL-based [data types or data engines] -- so Impala and Hive will share the same privilege model because they share the same metadata model."

Beyond this, Cloudera casts Sentry as a central security authority -- not just for Hadoop (which Cloudera likewise locates at the enterprise core -- i.e., as an enterprise data hub), but for apps or services of every kind, which will be able to hook into it and use it as a provider. Mudge cites Apache Solr -- an open source search and content management facility -- as one such example.

Sentry is still gestating, however. Currently, Cloudera's commercial distribution of Hadoop (Cloudera Enterprise) relies on products from partners to deliver advanced security features, such as data masking, tokenization, and volume-level encryption. "What you're seeing is that within the data protection layer of Hadoop, Hadoop itself doesn't necessarily provide those capabilities out of the box, but it relies on the insertion points in the substrate for partners," says Mudge.

One Hadoop-focused vendor that isn't a Cloudera partner is Zettaset Inc. It nonetheless partners with a veritable Who's Who of DM players, including Actian Corp., IBM Corp., Informatica Corp., MicroStrategy Corp., and Teradata Corp. Zettaset's specialty? Nothing less than "secure big data management," says president and CEO Jim Vogt.

"What we've built is enterprise software [Zettaset Orchestrator] that rides up on top of open source software and hardens it for the enterprise. Our focus is on ease of management, scale, performance, and security," explains Vogt, who argues that "the big thing holding up [production] deployments [of Hadoop] is security. [Adopters] have [security] mandates or requirements they just can't meet" using free or commercial distributions of Hadoop.

Vogt explicitly contrasts Zettaset's approach -- "We file patents, we don't just donate everything back to the community." -- with those of players such as Cloudera and Hortonworks Inc. "We support [volume-level] encryption," he explains, "and we have some patents that we've filed around FLASH-aware and SAN-aware [encryption] so that you can optimize based on that."

Zettaset is extending its RBAC facility to interoperate with the BI and DBMS offerings of its partners. "We have role-based access control that is very granular. We have an API for RBAC, and we opened up our security framework across [Hadoop] distributions and across applications or databases. We can integrate with MicroStrategy, with Teradata, with Hortonworks," says Vogt.

"One key partner is Informatica, and they allow us to do ETL and data transfer and also some basic visualization. What they like about [partnering with] us is, they were certifying [Hadoop] distribution by [Hadoop] distribution, but if they use Orchestrator, they can certify for just us."

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

Find the right level of Membership for you.

Learn More

↑

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

RESEARCH & RESOURCES

Taking the Sting Out of Hadoop's Growing Pains

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

TDWI

Engage

Research