TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Expert Panel: What's Next in Data Integration: Powering the AI-Driven Enterprise August 25, 2025
  - Expert Panel: Improving Data Quality, Accuracy, and Consistency August 27, 2025
  - The State of Self-Service Analytics: Results from TDWI’s Latest Research September 8, 2025
  - Expert Panel: Building an AI-Driven Data Strategy September 15, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
  - Executive Summit TDWI Data & AI Leaders Summit Orlando: Governing Data, Analytics, and AI November 17, 2025
- Virtual Live Seminars
  - Data Governance Week July 30, 2025
  - Platforms & Architecture Week July 30, 2025
  - AI Bootcamp Week July 30, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

TDWI Articles

The Hadoop Ecosystem: Celebrating 10 Years, Tackling New Challenges

Even after 10 years, Hadoop technologies are still expanding, and I believe that we will soon see progress in two key areas that will improve stability and flexibility.

By David Stodder
May 2, 2016

'Tis the season of trade shows and conferences, from TDWI's own educational events to user conferences to the recent Strata + Hadoop World event in San Jose. My bags seem perpetually packed, the laundry always needs doing, and I even caught myself wearing a conference badge around my neck at home the other night. My wife said I needed it.

This year is the 10th anniversary of Hadoop, which has grown up into a broad ecosystem of technologies based on foundational Apache open source projects. Hadoop and MapReduce are certainly among the most important and influential technology developments of our generation.

Doug Cutting, now chief architect at Cloudera, who along with computer scientist Mike Cafarella drove the early development of Hadoop and MapReduce, commemorated the anniversary with this interesting essay. Other good articles have been written in recent months about the Hadoop ecosystem's illustrious history, including at Datanami and InformationWeek, so I will not recount it in this column.

The anniversary did make the mood at Strata + Hadoop a little more thoughtful and retrospective than usual, with Cutting and other speakers giving their historical perspectives on developments over the past decade. Our industry is usually so focused on the present and future that little time is taken to consider the past.

Alas, with my schedule packed full of meetings with vendors and presentations, I had little time to think about history. My focus was more on the state of the Hadoop ecosystem today and into the near future.

Technologies in the Hadoop ecosystem continue to develop and expand, often so rapidly that it's hard to get a stable view of its current state. You must always take into account new projects or changes to existing ones that can alter the picture. Yet, as more organizations base strategic, data-driven initiatives on the Hadoop ecosystem, demands for stability and maturity are growing.

Based on my discussions at the conference and related research, I believe that we will see progress in some key areas that will improve stability and flexibility. Here is a quick look at two major areas to watch.

1. Improving preparation, security, and management of data lakes

As organizations pour diverse data into Hadoop data lakes, "data swamp" has become the new descriptive term many are using to describe its true state. Sometimes a swamp is exactly what an organization's data scientists want; after all, the more positive ecological term for a swamp is a wetland, which can be teeming with life just as a data lake could be full of insights waiting to be discovered.

Nevertheless, most organizations would like to increase the maturity their data lakes beyond the swamp stage toward more efficient, secure, well-governed, and valuable data resources. Repetitive, one-off data preparation routines and uncertain data quality wastes time and resources and can introduce errors into analytic processes.

Because data lakes are not data warehouses, new management tooling is needed that can support multi-structured data, schema-on-read processing, and other differences. A major focus of many tools is improving the development and use of metadata, including through application of machine learning to discover what is in the data lake, validate and improve the data's quality, and manage the data's life cycle from ingestion to analysis. Machine learning is important if for no other reason than it is beyond humans to keep up with the massive volumes of data coming into lakes and manage them properly.

Data preparation, data quality, and metadata catalog development will be covered in an upcoming TDWI Best Practices Report I am writing (many thanks to those who participated in our research survey).

The topic was a dominant one at Strata, where I met with several vendors focused on improving preparation, security, and management of Hadoop data lakes, including Podium Data, Talend, Trifacta, Trillium, Unifi, and Zaloni. Several of these (and other vendors) are competing to provide something close to a comprehensive, integrated management system to relieve IT organizations of having to knit a suite of tools together themselves.

Security has been a somewhat forgotten concern in the evolution of data lakes, but it cannot be anymore, especially in our age of rampant cybercrime. Procedures that may be more or less effective for traditional databases and data warehouses do not function as well in the Hadoop ecosystem.

I had a great if somewhat scary conversation with Reiner Kappenberger of Hewlett Packard Enterprise (HPE) Data Security Global Product Management about all the security gaps in most data lakes and the database and data access programs that work with them. HPE and some other innovators have offerings to improve security without locking down data so tight that analytics can't work with it. Security will be a critical challenge to address as organizations grow more reliant on Hadoop data lakes as part of their data architecture.

2. Streaming data and real-time analytics for operations

The other big topic at Strata and in the industry generally this year is streaming. Today, leading firms in industries such as financial services, healthcare, energy, telecom, manufacturing, government, and more are capturing insights from data and event streams and delivering real-time analytics for both human and automated decisions.

Rapid growth in the Internet of Things (IoT) -- the brave new world of sensors, smart devices, apps, and services -- is generating a fast-flowing stream of potentially valuable data. Real-time data and analytics are at the cutting edge of customer intelligence, process improvement, resource management, and the development of new products and services.

Some of the most interesting new technologies in the Hadoop ecosystem are focused on stream processing, "fast batch," and real-time analytics use cases that are not supported by Hadoop and MapReduce. Systems need to be able to know and react to every event in a stream in case it is significant or fits a predicted pattern.

Using analytics, organizations want to correlate across data streams and historical data sources to gain a complete view of customer activity, fraud and abuse behavior, cybercrime, and more. Apache Spark Streaming, Apex, Storm, and Kafka are garnering considerable attention as are commercial technologies that are built on these open source projects.

I talked to in-memory technology providers such as VoltDB and longtime supercomputer technology provider Cray about how they deploy combinations of multi-platform, in-memory, and parallel processing technologies to pull in massive data streams for deep analytics.

As Spark Streaming and other current and coming technologies mature, we will see the Hadoop and MapReduce ecosystem shift increasingly away from its batch origins to support a variety of processing and analytics requirements and most particularly real-time stream processing.

Congratulations to all the innovators who brought us through the first 10 years of the Hadoop ecosystem -- and in an open source world, that's quite a lot of people! May the best be yet to come.

About the Author

David Stodder is director of TDWI Research for business intelligence. He focuses on providing research-based insight and best practices for organizations implementing BI, analytics, performance management, data discovery, data visualization, and related technologies and methods. He is the author of TDWI Best Practices Reports on mobile BI and customer analytics in the age of social media, as well as TDWI Checklist Reports on data discovery and information management. He has chaired TDWI conferences on BI agility and big data analytics. Stodder has provided thought leadership on BI, information management, and IT management for over two decades. He has served as vice president and research director with Ventana Research, and he was the founding chief editor of Intelligent Enterprise, where he served as editorial director for nine years.

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.

↑

TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

The Hadoop Ecosystem: Celebrating 10 Years, Tackling New Challenges

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI

Engage

Research

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

The Hadoop Ecosystem: Celebrating 10 Years, Tackling New Challenges

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects, and Your Career

TDWI

Engage

Research

Accelerate Your Projects,
and Your Career