The Death of Hadoop?
Is Hadoop dead? Not so fast. Plan on supporting multiple environments for some time to come.
- By Barry Devlin
- February 26, 2019
The recent "merger of equals" between Cloudera and Hortonworks has triggered speculation about the possible imminent demise of Hadoop. Market observers question if the merger indicates a shrinking Hadoop ecosystem market that can no longer support its two largest competing beasts.
Immediately after the close of the deal on January 3, Silicon Valley Business Journal reported that the combined Cloudera and Hortonworks was valued at less than $3 billion, down from a joint value of about $5.2 billion when they announced the deal in October 2018. Investors, it appears, were less than impressed.
The overarching consideration is the cloud. The analytics world is looking skyward for mostly financial reasons. Storage, processing, and systems maintenance costs for analytics are considerably cheaper in the cloud than on premises. The economic arguments that favored Hadoop-based data lakes over relational data warehouses for the past decade are now turned against Hadoop. Furthermore, usage peaks are handled on demand in the cloud, rather than requiring reserve capacity on site that sits idle most of the year. Anaconda's Mathew Lodge highlighted these issues when the merger was first announced. His conclusion was "that after a good 10 years of Cloudera and Hortonworks being the center of the Big Data universe, the center of gravity has moved elsewhere."
I suspect that Lodge's center of gravity conclusion may be a little simplistic. Both data gravity and investment gravity need to be considered.
Hadoop Lives On ... For Now
Data gravity speaks to where the source data for analytics originates. Lifting large volumes of internally sourced data to the cloud requires energy, both physical and financial. If the data to be modelled consists of ten years' worth of point-of-sale data stored in your data center, and you further require uploading of significant volumes of sales every day, you face a trade-off between the data transfer cost and the storage and processing costs. Latency must be considered. Privacy, security, and data sovereignty considerations may also come into play.
Investment gravity refers to the sunk cost of existing environments in terms of both physical assets and skills acquisition. With up to ten years of investment in Hadoop, putting the elephant out to pasture may be less attractive. Businesses beginning to reap the rewards of a more-challenging-than-expected implementation may hesitate to jump just yet. Those with successful implementations will also think twice, just as those with effective data warehouses were (and still are) slow to migrate.
So-called legacy technology is hard to kill. The mainframe still lives and thrives, having adapted to the evolving environment. So, too, will Hadoop adapt and live with the cloud.
The Immediate Challenge
Many business intelligence (BI) and analytics departments face a short-term challenge. Their data warehouses may be deemed legacy systems but still play a key role in reconciling data and delivering standard reporting and basic BI. Hadoop-based data lakes provide the foundation for analytics. Cloud solutions for BI and analytics are maturing and offer advantages in elasticity and cost. For the immediate future, it seems likely that already stretched IT departments will need to support all three environments.
Over the past year, I have written about the need to combine existing data warehouses and lakes in a production analytics platform. The underlying driver of this approach is a recognition of the differing core strengths of the two environments -- consistency of the warehouse, agility of the lake. A production analytics platform does not eliminate the effort of maintaining two environments. Its value comes from reducing -- or, ideally, removing -- the data management complexity that comes with two separated environments.
Extending the platform concept to include cloud makes sense. The core strength of cloud is its proximity to the major sources of data today and into the future: social media and the Internet of Things. Positioning cloud as master for such data follows sound data management and architecture principles. It avoids the false and often unachievable goal of moving all internally sourced data to the cloud or the equally improbable idea of bringing all externally sourced data into the data lake. In physical terms, the foundation is already appearing in the form of hybrid cloud infrastructures.
Outlook: Cloudy Skies and Thriving Elephants
My prognosis is that Hadoop is here for the long haul. Cloud is undoubtedly set to grow. We may see a decrease in new project starts in the Hadoop environment as improving cloud solutions are favored for new projects and for some data warehouse upgrades that might have previously gone to Hadoop. The consolidation of Hortonworks' and Cloudera's Hadoop distros will decrease systems management concerns for data lake implementers and may well encourage existing, ongoing Hadoop projects to persevere to achieve return on investments already made.
The bottom line: plan to support all three environments for the foreseeable future.
Dr. Barry Devlin defined the first data warehouse architecture in 1985 and is among the world’s foremost authorities on BI, big data, and beyond. His 2013 book, Business unIntelligence, offers a new architecture for modern information use and management.