TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Data Integration for AI: Overcoming Modern Pipeline Challenges July 23, 2025
  - From Silos to Insights: Centralizing Data to Drive AI July 24, 2025
  - Expert Panel: Leveraging AI-Powered Solutions for Data Management July 28, 2025
  - A Generative AI Framework for Credit and Financial Markets July 29, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Executive Summit AI Accelerate 2025, Brought to You by AI Boadroom & TDWI August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
- Virtual Live Seminars
  - TDWI Data Governance Principles and Practices: Managing Data as an Asset June 25, 2025
  - Building Your Company’s Data Governance Roadmap June 25, 2025
  - Data Governance: Driving Engagement and Organizational Change June 26, 2025
  - A Framework for Modern Data Governance June 25, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

TDWI Articles

Benefits of the Hadoop-Based Data Lake

In some emerging best practices, a free-form data lake implemented on Hadoop complements a structured relational data warehouse.

By Philip Russom
November 3, 2016

The data lake is a new design pattern that specifies a few rules for organizing data, similar to how older design patterns did. The primary rule for a data lake is that it should be a repository for raw, detailed data that's captured, stored, and managed in its original schema or format with little or no transformation.

A data lake focuses on detailed source data so that the source can be repurposed many ways as new requirements in advanced analytics evolve and emerge. The data lake "future-proofs" analytics by provisioning ample source data for a wide range of analytics that cannot yet be foreseen because of the rapid pace of change within organizations and across marketplaces.

Given today's exploding data volumes, a data lake needs to scale to tens or hundreds of terabytes and sometimes petabytes. To provide massive scalability at a reasonable cost, Hadoop has arisen as the most common data platform for data lakes. Even so, data lakes and similar emerging data-driven design patterns (e.g., data vaults, enterprise data hubs) may also be deployed on relational database management systems (RDBMSs) or other file systems besides Hadoop.

How Data Lakes and Data Warehouses Can Work Together

The Hadoop-based data lake is important because it can extend the life and capabilities of a data warehouse.

One of the stronger trends in data warehousing is to diversify the portfolio of data platforms so that technical users can choose just the right platform for storing, processing, or delivering data sets and the products based on them. In the modern multiplatform data warehouse environment (DWE), almost all core warehouses still run on RDBMSs, but they may be integrated with other platforms -- typically Hadoop and specialized RDBMSs (based on appliances, columns, clouds, or specific forms of analytics).

In this hybrid environment, the core warehouse continues to be the preferred platform for reporting (from standard reports to dashboards), dimensional data (for OLAP, cubes, star schema, etc.), and data that requires extensive improvement or accuracy (e.g., financial reports).

However, the raw data for advanced forms of analytics is progressively being stored and processed on the other platforms of the DWE. This offloads the core warehouse so it can scale and focus on data that requires mature relational functionality (as reports and dimensional data do). This also takes raw, detailed data to platforms that are well suited to advanced forms of analytics (based on mining, clustering, statistics, graph, etc.) at scale and with a reasonable cost.

The Hadoop-based data lake is emerging as a natural fit for the large volumes of data for advanced analytics that are being relocated as organizations modernize their DWEs. Even so, TDWI also sees columnar databases and other specialty RDBMSs playing roles within the DWE.

The trend is toward having the Hadoop-based data lake be the ingestion platform and analytics archive for the DWE, while sandboxing and set-based analytics are done on specialty RDBMSs (but perhaps on Hadoop, too) and reporting and related functions are provisioned by the core warehouse.

Architectures Still Evolving

Note that it is still early days for the multiplatform architecture of the DWE, as well as for Hadoop and the data lake. It is difficult to say into what architectural patterns the DWE will eventually evolve.

However, one direction is sure: a wide range of organizations will continue to diversify their data platforms as they let go of older paradigms that sought to make a single data warehouse instance handle all or most data handling. Instead, most DW programs are moving toward multiple best-of-breed and purpose-built platforms that are tightly integrated. (The survey for the 2016 TDWI Best Practices Report: Data Warehouse Modernization corroborates this claim.)

The Hadoop-based data lake fits these and other trends quite well. The real driver is that enterprises need a broader range of analytics types so they can get better at making fact-based decisions, optimizing their organizational performance, and competing on analytics.

The Hadoop-based data lake is gaining in popularity because it can capture the volume of big data and other new sources that enterprises want to leverage via analytics, and it does so at a low cost and with good interoperability with other platforms in the DWE. In this sense, Hadoop and data lakes add value to the DW and its environment without ripping and replacing mature investments.

In other words, in the emerging best practices of the DWE, a free-form data lake complements a structured data warehouse. That's why TDWI expects to see both working together in more and more DWEs.

Further Reading: For a deeper definition of the data lake, read my article from June 2016: "The Data Lake -- What it is, What it's for, Where it's going."

For more about the trend toward multiplatform data warehouse environments (DWEs), read TDWI Best Practices Report: Data Warehouse Modernization.

About the Author

Philip Russom is director of TDWI Research for data management and oversees many of TDWI’s research-oriented publications, services, and events. He is a well-known figure in data warehousing and business intelligence, having published over 600 research reports, magazine articles, opinion columns, speeches, Webinars, and more. Before joining TDWI in 2005, Russom was an industry analyst covering BI at Forrester Research and Giga Information Group. He also ran his own business as an independent industry analyst and BI consultant and was a contributing editor with leading IT magazines. Before that, Russom worked in technical and marketing positions for various database vendors. You can reach him at [email protected], @prussom on Twitter, and on LinkedIn at linkedin.com/in/philiprussom.

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.

↑

TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

Benefits of the Hadoop-Based Data Lake

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI

Engage

Research

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

Benefits of the Hadoop-Based Data Lake

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects, and Your Career

TDWI

Engage

Research

Accelerate Your Projects,
and Your Career