TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Expert Panel: Leveraging AI-Powered Solutions for Data Management July 28, 2025
  - A Generative AI Framework for Credit and Financial Markets July 29, 2025
  - Redefining Clinical Operations with Agentic AI: Accelerating Innovation Across Data Management and Site Monitoring July 30, 2025
  - Smarter Marketing in Retail: How AI and Modern Data Foundation Drive Growth July 31, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
  - Executive Summit TDWI Data & AI Leaders Summit Orlando: Governing Data, Analytics, and AI November 17, 2025
- Virtual Live Seminars
  - Platforms & Architecture Week July 25, 2025
  - AI Bootcamp Week July 25, 2025
  - Data Governance Week July 25, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

TDWI Articles

Data Lake Management Innovations

When designed and managed properly, a data lake can enable faster, more trusted big data analytics.

By Philip Russom
January 23, 2017

I recently spoke in a webinar run by Informatica Corporation, sharing the stage with Informatica's Murthy Mathiprakasam and Cognizant's Tavo De Leon. We three had an interactive conversation about the technology and business requirements of data lakes as faced today by data management professionals and the organizations they serve.

There's a lot to say about data lakes, but I focused on the roles played by metadata and data governance because these are two of the most pressing requirements. Below I've summarized my portion of the webinar.

The data lake is about earlier data ingestion and later data preparation on the fly.

A data lake ingests data in its raw, original state, straight from data sources, with little or no cleansing, standardization, remodeling, or transformation. Data management best practices can then be applied flexibly later as diverse use cases demand.

Data in a lake can be improved on the fly during exploration (for ad hoc discovery and analytics), at intermediate stages to prep data for recurring tasks (such as reporting and performance management), or much later (as new analytics applications are envisioned).

Many other scenarios are possible, too, with evolving data lake practices. However, the trend is toward doing less "pre-preparation" of data so that data "discovery zones" are updated with data in an agile manner. Note that some data preparation is still in use but nowhere near the extreme level seen for warehouses and reports. The early ingestion of data means that operational data is captured and made available as soon as possible, and yet the data is still prepped so that it's fit for the intended purposes of exploration, discovery, and analytics.

Most data lakes are built atop Hadoop, which allows them to capture big data and enable advanced analytics processing. Hadoop enables a data lake to capture, process, and repurpose a wide range of data types and structures with linear scalability and high availability.

Although it may sound very new, a Hadoop-based data lake still needs established best practices and tools for data management. That way the lake can participate in both old and new data supply chains in a fast, flexible, systematic, and repeatable fashion.

Likewise, Hadoop-based data lakes are proving that they can integrate with a wide range of enterprise data ecosystems and be managed according to policy-based data governance. In these contexts, good data management and governance can raise the quality and usefulness of the data and keep the lake from deteriorating into a so-called data swamp.

Data lakes are already deployed in real-world use cases.

Physically speaking, all these data lakes may be in one enterprisewide Hadoop cluster, but logically speaking they are separate data lakes.

Analytics data lakes. These can be as simple as standalone data lakes built for one application, such as sentiment analysis or money laundering detection. In other cases, a Hadoop-based data lake can extend and reduce the burden on a data warehouse by supporting data staging, archiving, and processing for analytics. In short, a Hadoop-based data lake can augment and modernize a data warehouse to embrace big data and advanced analytics without replacing the warehouse.

Marketing data lakes. These are hot right now as marketers discover that a data lake is excellent for making correlations and predictions across multiple customer channels, which in turn leads to higher conversion rates in cross-selling. The same lake can also enable new levels of accuracy and insight for customer segmentation and complete views of customers.

At TDWI, we're also seeing other data lakes with a business function or industry focus -- for example, sales performance data lakes, healthcare data lakes, and financial fraud data lakes.

Diverse metadata makes a data lake more accessible, valuable, and trusted for a wider range of user types.

Today, a growing number of nontechnical or somewhat technical users want to work hands-on with data. Instead of raw technical metadata, these users need business metadata, which employs human-language descriptions of data. In fact, without business metadata, the range of users who can access the data of a lake is seriously limited.

Note that business metadata is created by technical users in addition to technical metadata. For the technical user to create business metadata that's truly useful and accurate, mappings between metadata types should be based on a governed business glossary of terms, which specifies the data owned by the business in industry- and corporate-standard language.

For many users, the point of implementing a Hadoop-based data lake is to enable self-service practices, including data access, exploration, discovery-oriented analytics, data prep, visualization, and advanced forms of analytics. Note that all these emerging self-service practices rely heavily on the cataloguing of business metadata. Without business metadata, nontechnical users cannot search for data using business terms, work quickly, independently, and collaboratively, and get full value from a data lake.

A data lake must be governed or else it may become a data swamp.

When a data lake is not managed properly, it deteriorates into a data swamp -- an undocumented and disorganized data store that is nearly impossible to navigate, trust, or leverage for organizational advantage. However, this risk is easily managed and mitigated by data governance and other process-driven data management best practices.

As with any important data asset, lake data should be curated by a steward who is responsible for driving trust and understanding of the data in the data store. TDWI feels that the best stewards are businesspeople (rather than technical staff) because they can prioritize based on business needs to keep data management aligned with business goals. Improvements to data lakes should give priority to metadata, "just enough" structure, and diversifying data and tools.

Conclusions

A data lake is a bit of a balancing act. On the one hand, the data lake's primary benefit is that it liberates analytics users by enabling new practices in agile data ingestion, with a focus on consolidating large volumes of diverse data in the lake. That, in turn, helps many users discover new opportunities and work with advanced analytics.

On the other hand, the data lake still needs some of the established best practices of data management and governance. Otherwise, the data managed in the lake can become redundant (and skew analytics results), lack a trusted audit trail, suffer integrity and quality problems, and be difficult to find and query. When those maladies beset a data lake, it becomes the dreaded data swamp.

Get the full benefit from your data lake by ingesting all kinds of data and by preparing and improving data to an appropriate degree so the data is accessible, trusted, and insightful.

If you'd like to hear more of my discussion with Informatica's Murthy Mathiprakasam and Cognizant's Tavo De Leon, please visit Database Trends and Applications to replay the Informatica webinar.

About the Author

Philip Russom is director of TDWI Research for data management and oversees many of TDWI’s research-oriented publications, services, and events. He is a well-known figure in data warehousing and business intelligence, having published over 600 research reports, magazine articles, opinion columns, speeches, Webinars, and more. Before joining TDWI in 2005, Russom was an industry analyst covering BI at Forrester Research and Giga Information Group. He also ran his own business as an independent industry analyst and BI consultant and was a contributing editor with leading IT magazines. Before that, Russom worked in technical and marketing positions for various database vendors. You can reach him at [email protected], @prussom on Twitter, and on LinkedIn at linkedin.com/in/philiprussom.

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.

↑

TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

Data Lake Management Innovations

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI

Engage

Research

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

Data Lake Management Innovations

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects, and Your Career

TDWI

Engage

Research

Accelerate Your Projects,
and Your Career