TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Expert Panel: Leveraging AI-Powered Solutions for Data Management July 28, 2025
  - A Generative AI Framework for Credit and Financial Markets July 29, 2025
  - Redefining Clinical Operations with Agentic AI: Accelerating Innovation Across Data Management and Site Monitoring July 30, 2025
  - Smarter Marketing in Retail: How AI and Modern Data Foundation Drive Growth July 31, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
  - Executive Summit TDWI Data & AI Leaders Summit Orlando: Governing Data, Analytics, and AI November 17, 2025
- Virtual Live Seminars
  - Platforms & Architecture Week July 25, 2025
  - AI Bootcamp Week July 25, 2025
  - Data Governance Week July 25, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

RESEARCH & RESOURCES

Hadoop, Cheap Storage, and Parallel Processing

Casting Hadoop as a platform for cheap storage -- e.g., a Hadoop-based "data lake" -- only gets at half of what makes Hadoop new, compelling, and uniquely valuable.

By Stephen Swoyer
May 26, 2015

Never mind the hype, says Jamie Keeffe, a product marketing manager with master data management (MDM) specialist Redpoint Global Inc. Casting Hadoop as a platform for cheap storage -- e.g., the increasingly ubiquitous Hadoop-based "data lake" -- only gets at half of what makes Hadoop new, compelling, and (from a customer perspective) uniquely valuable.

Remember, Keeffe points out, Hadoop combines a scalable distributed storage layer -- HDFS, or the Hadoop Distributed File System -- with a baked-in, general-purpose parallel processing layer.

This combination is new if not exactly unprecedented. The massively parallel processing, or MPP, database boasts something similar -- with one (critical) caveat: an MPP database isn't in any sense a "general-purpose" parallel processing environment. It's optimized specifically for query processing.

These days, Hadoop can do it all, says Keeffe, who claims he's frustrated with the many ways in which Hadoop has been co-opted by self-serving vendors.

"The way most [master data management or data quality vendors] treat Hadoop, it's effectively relegated to the role of just cheap storage. They're designed to move large volumes of data out of Hadoop across the wire and into a traditional MDM [hub] or data quality [engine] for processing. Adding insult to injury, the data [that's processed] in the traditional MDM [hub] is then moved back into Hadoop -- and it's no richer for the journey."

Keeffe arguably has a self-serving interest -- his company, RedPoint, markets its own Data Management Platform for Hadoop, after all -- but he does make a good point. Because Hadoop consolidates storage and compute, it's possible to bring processing to the data -- instead of consolidating data at a central site and processing it there.

In this scheme, which is no less attractive because of its low cost (Hadoop is a comparatively inexpensive platform for both distributed storage and parallel processing), Hadoop becomes the de facto storage repository for relevant business information -- or, in some implementations, for all information that's generated or collected by an organization.

This begs a question that's by no means specific to Keeffe and RedPoint: if an organization has already invested in Hadoop, why shouldn't it take advantage of the Hadoop platform's cheap, scalable storage? Why shouldn't an enterprise use Hadoop as a context in which to stage and prepare data, as well as to cleanse and standardize it?

This is precisely Keeffe's point.

"For data management, this [self-serving] approach creates some serious opportunity costs. When you're doing such things as identifying, matching, or linking with highly configurable rules, [those kinds of workloads] can significantly benefit from the ability to split off a job and process it across Hadoop nodes or even [across] the [SMP] cores within those individual nodes," he points out.

"However, if you move [the data] out of Hadoop for processing in an external MDM [hub or data quality engine], you lose that ability, and [parallel processing for MDM and data quality] is an obvious business application for Hadoop. If you're doing MDM, you can process data as you load it, or shortly [there]after. The kinds of [operations] you do in MDM -- [for example,] identifying and matching -- can benefit from Hadoop's parallelism. [Hadoop] jobs can be multi-threaded so that they split themselves up to run across the different nodes to take advantage of available compute resources. If [a job in Hadoop is] YARN-certified, you can have really fine-grained control. You can have [jobs] that would take potentially hours to run in an external MDM [hub] completing in just minutes."

By YARN, Keeffe means the new resource manager (YARN is actually a bacronym for "Yet Another Resource Manager") that debuted, 18 months ago, with version 2.0 of Hadoop. YARN is the culmination of a massive overhaul of Hadoop's baked-in parallel computing architecture. Unlike Hadoop v1.x -- which was tightly coupled to a batch-only implementation of the MapReduce engine -- Hadoop 2.x and YARN now support interactive and query workloads (via Apache Tez) and real-time data processing (via Apache Slider), along with, of course, brute-force batch workloads (via legacy MapReduce).

More important, YARN makes it possible for third-party engines, such as Apache Spark, to run as full-fledged citizens -- complete with granular resource management -- in a Hadoop cluster. (It's always been possible to run-third party engines in Apache Hadoop, but -- prior to YARN, and absent the use of distribution-specific or proprietary management tooling -- it wasn't possible to manage or, more precisely, to allocate compute resources for non-MapReduce jobs.)

In other words, versions of Hadoop prior to 2.0 were tightly coupled to MapReduce, such that it wasn't possible to schedule and parallelize -- with anything approaching granularity -- non-MapReduce workloads in the Hadoop environment. YARN decoupled Hadoop from this dependence.

Keeffe could be said to have a self-serving interest in playing up this aspect of RedPoint's Hadoop integration, however. After all, RedPoint claims that its Data Management Platform for Hadoop is a "native" YARN application. If you're shrugging your shoulders thinking "Big deal," Keeffe begs to differ. There's a world of difference, he argues, between "YARN-ready" software and applications -- such as RedPoint Data Management for Hadoop -- that are YARN native.

For example, an application that uses Hive -- a SQL-like interpreter for Hadoop that compiles Hive Query Language (HiveQL) queries into MapReduce jobs -- to query Hadoop data, or to get data into and out of Hadoop, qualifies as "YARN-ready." However, a YARN-ready application doesn't have fine-grained control over scheduling, resource use, parallelization, and other aspects of Hadoop performance.

"The number one thing is that [running as a native YARN application] eliminates 100 percent of the programming code that runs on Hadoop. If your application can speak native YARN, you can write that ApplicationMaster and YARN will instantiate all of the core engines [required by] that ApplicationMaster to run in Hadoop the same way that MapReduce does."

Keeffe's citation of an ApplicationMaster is borderline technical, but -- to vastly over-simplify what's involved -- think of an ApplicationMaster as kind of like YARN's DNA. It's an encoded, highly detailed template for executing workloads in parallel. (DNA is itself an encoded template for protein synthesis, which is the engine of cell regeneration and growth.)

Keeffe's point is that writing to (or using a YARN-native app such as RedPoint to automatically generate) an ApplicationMaster eliminates the need for highly specialized, domain-specific coding skills (e.g., coding data engineering-specific transformations -- such as directed acyclic graphs -- in Java or Pig) and likewise simplifies the process of scheduling workloads to run in optimized engines such as Tez or Slider.

"Without an ApplicationMaster, the point is that it's not a native YARN application, so there will be coding that needs to be done -- either coding that's generated by your apps and processes in Hadoop or coding that you have to write yourself in Python, Scala, or other [languages]."

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

Find the right level of Membership for you.

Learn More

↑

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

RESEARCH & RESOURCES

Hadoop, Cheap Storage, and Parallel Processing

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

TDWI

Engage

Research