The Hadoop Ecosystem: Celebrating 10 Years, Tackling New Challenges
Even after 10 years, Hadoop technologies are still expanding, and I believe that we will soon see progress in two key areas that will improve stability and flexibility.
- By David Stodder
- May 2, 2016
'Tis the season of trade shows and conferences, from TDWI's own educational events to user conferences to the recent Strata + Hadoop World event in San Jose. My bags seem perpetually packed, the laundry always needs doing, and I even caught myself wearing a conference badge around my neck at home the other night. My wife said I needed it.
This year is the 10th anniversary of Hadoop, which has grown up into a broad ecosystem of technologies based on foundational Apache open source projects. Hadoop and MapReduce are certainly among the most important and influential technology developments of our generation.
Doug Cutting, now chief architect at Cloudera, who along with computer scientist Mike Cafarella drove the early development of Hadoop and MapReduce, commemorated the anniversary with this interesting essay. Other good articles have been written in recent months about the Hadoop ecosystem's illustrious history, including at Datanami and InformationWeek, so I will not recount it in this column.
The anniversary did make the mood at Strata + Hadoop a little more thoughtful and retrospective than usual, with Cutting and other speakers giving their historical perspectives on developments over the past decade. Our industry is usually so focused on the present and future that little time is taken to consider the past.
Alas, with my schedule packed full of meetings with vendors and presentations, I had little time to think about history. My focus was more on the state of the Hadoop ecosystem today and into the near future.
Technologies in the Hadoop ecosystem continue to develop and expand, often so rapidly that it's hard to get a stable view of its current state. You must always take into account new projects or changes to existing ones that can alter the picture. Yet, as more organizations base strategic, data-driven initiatives on the Hadoop ecosystem, demands for stability and maturity are growing.
Based on my discussions at the conference and related research, I believe that we will see progress in some key areas that will improve stability and flexibility. Here is a quick look at two major areas to watch.
1. Improving preparation, security, and management of data lakes
As organizations pour diverse data into Hadoop data lakes, "data swamp" has become the new descriptive term many are using to describe its true state. Sometimes a swamp is exactly what an organization's data scientists want; after all, the more positive ecological term for a swamp is a wetland, which can be teeming with life just as a data lake could be full of insights waiting to be discovered.
Nevertheless, most organizations would like to increase the maturity their data lakes beyond the swamp stage toward more efficient, secure, well-governed, and valuable data resources. Repetitive, one-off data preparation routines and uncertain data quality wastes time and resources and can introduce errors into analytic processes.
Because data lakes are not data warehouses, new management tooling is needed that can support multi-structured data, schema-on-read processing, and other differences. A major focus of many tools is improving the development and use of metadata, including through application of machine learning to discover what is in the data lake, validate and improve the data's quality, and manage the data's life cycle from ingestion to analysis. Machine learning is important if for no other reason than it is beyond humans to keep up with the massive volumes of data coming into lakes and manage them properly.
Data preparation, data quality, and metadata catalog development will be covered in an upcoming TDWI Best Practices Report I am writing (many thanks to those who participated in our research survey).
The topic was a dominant one at Strata, where I met with several vendors focused on improving preparation, security, and management of Hadoop data lakes, including Podium Data, Talend, Trifacta, Trillium, Unifi, and Zaloni. Several of these (and other vendors) are competing to provide something close to a comprehensive, integrated management system to relieve IT organizations of having to knit a suite of tools together themselves.
Security has been a somewhat forgotten concern in the evolution of data lakes, but it cannot be anymore, especially in our age of rampant cybercrime. Procedures that may be more or less effective for traditional databases and data warehouses do not function as well in the Hadoop ecosystem.
I had a great if somewhat scary conversation with Reiner Kappenberger of Hewlett Packard Enterprise (HPE) Data Security Global Product Management about all the security gaps in most data lakes and the database and data access programs that work with them. HPE and some other innovators have offerings to improve security without locking down data so tight that analytics can't work with it. Security will be a critical challenge to address as organizations grow more reliant on Hadoop data lakes as part of their data architecture.
2. Streaming data and real-time analytics for operations
The other big topic at Strata and in the industry generally this year is streaming. Today, leading firms in industries such as financial services, healthcare, energy, telecom, manufacturing, government, and more are capturing insights from data and event streams and delivering real-time analytics for both human and automated decisions.
Rapid growth in the Internet of Things (IoT) -- the brave new world of sensors, smart devices, apps, and services -- is generating a fast-flowing stream of potentially valuable data. Real-time data and analytics are at the cutting edge of customer intelligence, process improvement, resource management, and the development of new products and services.
Some of the most interesting new technologies in the Hadoop ecosystem are focused on stream processing, "fast batch," and real-time analytics use cases that are not supported by Hadoop and MapReduce. Systems need to be able to know and react to every event in a stream in case it is significant or fits a predicted pattern.
Using analytics, organizations want to correlate across data streams and historical data sources to gain a complete view of customer activity, fraud and abuse behavior, cybercrime, and more. Apache Spark Streaming, Apex, Storm, and Kafka are garnering considerable attention as are commercial technologies that are built on these open source projects.
I talked to in-memory technology providers such as VoltDB and longtime supercomputer technology provider Cray about how they deploy combinations of multi-platform, in-memory, and parallel processing technologies to pull in massive data streams for deep analytics.
As Spark Streaming and other current and coming technologies mature, we will see the Hadoop and MapReduce ecosystem shift increasingly away from its batch origins to support a variety of processing and analytics requirements and most particularly real-time stream processing.
Congratulations to all the innovators who brought us through the first 10 years of the Hadoop ecosystem -- and in an open source world, that's quite a lot of people! May the best be yet to come.
About the Author
David Stodder is director of TDWI Research for business intelligence. He focuses on providing research-based insight and best practices for organizations implementing BI, analytics, performance management, data discovery, data visualization, and related technologies and methods. He is the author of TDWI Best Practices Reports on mobile BI and customer analytics in the age of social media, as well as TDWI Checklist Reports on data discovery and information management. He has chaired TDWI conferences on BI agility and big data analytics. Stodder has provided thought leadership on BI, information management, and IT management for over two decades. He has served as vice president and research director with Ventana Research, and he was the founding chief editor of Intelligent Enterprise, where he served as editorial director for nine years.