TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Expert Panel: What's Next in Data Integration: Powering the AI-Driven Enterprise August 25, 2025
  - Architecting a Modern Martech Stack for Speed, Scale, and AI Readiness August 26, 2025
  - Expert Panel: Improving Data Quality, Accuracy, and Consistency August 27, 2025
  - The State of Self-Service Analytics: Results from TDWI’s Latest Research September 8, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
  - Executive Summit TDWI Data & AI Leaders Summit Orlando: Governing Data, Analytics, and AI November 17, 2025
- Virtual Live Seminars
  - Data Governance Week July 30, 2025
  - Platforms & Architecture Week July 30, 2025
  - AI Bootcamp Week July 30, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

TDWI Blog

TDWI Blog: Data 360

Q&A RE: Hadoop for the Enterprise

Attendees of a recent TDWI Webinar asked excellent questions.

By Philip Russom, TDWI Research Director for Data Management

Recently, on April 14, I broadcast a TDWI Webinar in which I presented some of the findings from my new TDWI report on "Hadoop for the Enterprise." You can download a free copy of the report in a PDF, and you can replay the Webinar. With each link, you may need to scroll down to find what you want. If you’re new to Hadoop, you may wish to first read the 2013 TDWI Best Practices Report Integrating Hadoop into Business Intelligence and Data Warehousing.

Attendees of the Webinar posed several very good questions about various issues around Hadoop. Please allow me to share a few attendee questions and the answers I sent them via e-mail:

What is a Hadoop cluster? And why would an organization need more than one?

The Wikipedia article on “Computer Cluster” is a good general description of all clustered server pools. The article doesn’t mention Hadoop, but Hadoop’s clustering strategy is in line with the article, except that Hadoop can run on heterogeneous servers, whereas the article recommends that all servers be identical. The point of any cluster is to get scalable and high-perfromance computational power, but at a relatively low cost because of commodity priced hardware.

An organization may need more than one Hadoop cluster, due to departmental funding and sponsorship (which is common with analytic applications) or other organizational dynamics. As I pointed out in the Webinar, as users decide on a strategy for Hadoop on an enterprise scale, they tend to abandon the departmental focus in favor of central IT providing Hadoop as a shared enterprise asset (as IT often does with corporate networks, racks of servers, and storage subsystems).

You don't need big data to take advantage of Hadoop?

That’s correct. I’ve found many user organizations with a small Hadoop implementation (8 nodes seems common) used as the data layer under a departmental analytic application or analytics sandbox of some sort. Hadoop makes sense when the department has exotic data (perhaps in lots of files), which Hadoop excels with. Use cases include sentiment analytics with schema-free human language text or supplier analytics with multi-structured XML or JSON files.

Note that, in the examples, the data volumes are modest, but it’s still “big data” in the sense that it’s not the usual structured and relational data. For many users dealing with big data (whether on Hadoop or elsewhere), the value proposition is that big data is new and different, and therefore offers new insights and more complete views of customers. Even when big data is truly big (tens of terabytes or more), users don’t have much trouble managing it; hence, big data is not a scalability crisis, as some people have claimed.

Hadoop has a well-deserved reputation for scaling up linearly. But these examples show that Hadoop also scales down successfully.

Do companies transfer master data into Hadoop to support analytics in a real-time or batch data replication process?

Yes, but that’s still rather rare today. In fact, only 10% of survey respondents who have Hadoop in production today are doing master data management (MDM) on Hadoop. But 45% anticipate doing so within three years. Similarly, data quality is in a similar position, with 11% doing it today versus 55% in the future. Personally, I’ve seen it take a while to ramp up all the data management best practices when a new data platform appears. That seems to be the case with Hadoop. But the proliferation of Hadoop into more of the enterprise is driving up requirements for data management best practices, too.

Let’s now focus on your question. Modern MDM architectures typically support a mix of operational and analytic purposes; they do the same on Hadoop.

Today, Hadoop is strong on volume but weak on real-time operation. So MDM (and other operations) are usually exclusively batch oriented. Given strong Hadoop projects like Storm and Spark, real-time data operations will become more favorable soon.

Can we get a use case for Hadoop and MDM?

As I mentioned in the Webinar, MDM on Hadoop is pretty rare today, but survey results show it will soon be far more common, along with similar practices like data quality.

There are many ways to architect an MDM solution, but many are built atop or around some kind of hub, which includes a database or operational data store (ODS) plus appropriate interfaces in and out of the hub. At TDWI, we’ve seen a number of organizations start migrating subsets of enterprise data to Hadoop, and simply modeled databases and ODSs seem to migrate to Hadoop successfully. The straightforward tabular structures of these (unlike complex warehouse dimensions) usually fit well with Hive tables or HBase in the Hadoop environment. With the so-called enterprise data hub on Hadoop gaining in popularity, we should expect to see more migrations like this in coming years.

A lot of MDM master databases (or systems of record) have very wide records, because they’re also used to compile the “complete view” of customers and other enterprise entities. I’ve heard conflicting opinions from Hadoop users; some think Hive tables are best for wide records, while others swear HBase is best. I hear similar debates involving query mechanisms, including HiveQL, Pig, Drill, and Impala. If you contemplate similar tasks, I recommend you take a known ODS to Hadoop and test on both Hive and HBase, with a variety of query approaches.

Can HBase replace a classic data warehouse, and can it compete from a performance side?

If you have a “classic” data warehouse, then I’ll assume it is designed for dimensional models, optimized for complex queries, and supported by a rich metadata layer with auditing capabilities. HBase today is not particularly good with any of those, so it makes an unlikely replacement.

Even so, some pieces of the warehouse environment do well on HBase. For example, many warehouses include a number of operational data stores (ODSs). These may be physically managed in the warehouse’s core database instance, or they may be running on standalone hardware servers and database instances. Either way, I’ve interviewed users who’ve migrated these pieces to HBase—or Hive or both. They say it’s an easy migration, tweaking on the new platform is minimal, and performance is fine, as long as batch processing is all you need. Furthermore, moving these pieces to Hadoop frees up capacity on the warehouse, so it can grow into more data and use cases that truly must reside in the core warehouse platform. Or, if the migrated ODSs were on standalone platforms, then Hadoop seems to work as a consolidation strategy.

There has been less talk [about] making Hadoop transaction oriented, i.e., ACID compliant. Is there any trend or survey outcome?

To be honest, I haven’t looked into transaction processing on Hadoop, although I’ve heard that some people in both open source and vendor communities are working on it.

Why would I be so remiss? Because the leading use cases I see today don’t require transaction processing and hence the four ACID properties. That includes extensions of data warehousing and data integration, plus a wide range of analytics. Upcoming use cases—data archiving and content management—don’t involve transaction processing either. Furthermore, if you want open source software, the other NoSQL database management systems are strong on transaction processing (as are older open source databases), so you may wish to look into those.

I’m sorry to cop out on you with a non-answer. But at least you can see that transaction processing on Hadoop is a low priority for those of us excited about doing data warehouse, data integration, reporting, and analytics on Hadoop.

Posted by Philip Russom, Ph.D. on April 15, 2015