Are Hadoop's Best Days Behind It -- Or Still Ahead?
Pessimists are predicting the end of Hadoop -- "peak Hadoop," in the words of one influential analyst. Optimists say Hadoop's future is assured. Who's right?
In a recent 2017 prospectus, Philip Russom, senior director of research for data management with TDWI, highlighted increased enterprise uptake of Hadoop as a major trend in 2016 and beyond.
"TDWI has seen a giant step forward in adoption starting in late 2015 and continuing into 2016. The survey from TDWI Best Practices Report: Data Warehouse Modernization shows that 17 percent of data warehouse programs surveyed already have Hadoop in production in their data warehouse environment. This is up from earlier surveys, which showed 10 to 12 percent," he wrote.
Will Cloud Lead to Hadoop's Decline?
Russom's optimism is something of an outlier. For example, the 2016 edition of Gartner's Hype Cycle for Information Infrastructure has Hadoop "sliding" into the "Trough of Disillusionment." At this year's Pacific Northwest BI Summit, Gartner analyst Merv Adrian was only slightly more generous. Hadoop, Adrian said, "has probably moved up the slope [of the Hype Cycle curve] a little bit. We think we are basically on the upslope, [which means we're] moving out of the Trough."
In a widely read article published in late October, Ovum analyst Tony Baer coined the expression "peak Hadoop." The greatest danger to Hadoop isn't Spark -- which is a compute engine, not a data management platform -- but cloud, Baer argued. It's easy for subscribers to spin up Spark instances in the cloud as needed -- much easier than budgeting for hardware and deploying Spark or Hadoop on premises.
If you combine Spark's compute power with inexpensive cloud storage -- such as Amazon's Scalable Storage Service (S3), Google Storage, or Microsoft Azure Storage -- there's no need for Hadoop, right?
So who's right? Optimists such as Russom or pessimists -- let's call them "contrarians" -- such as Gartner and Baer? Isn't it possible that there's cause for both optimism and pessimism?
Production Use Exposing Shortcomings
According to Adrian, one reason Hadoop is mired in Gartner's Trough of Disillusionment is because companies are now using it for real production workloads. The transition from test or prototype to production system has exposed some of Hadoop's shortcomings.
There's "real use by real [companies] in sizable numbers that have moved beyond experimentation and pilot [projects and are] actually putting things in production," Adrian told attendees. "We're ... starting to get ... negative feedback because now people are expecting these things to be usable."
Baer stresses that Hadoop offers data management amenities that are missing from Spark. He sees the data lake as a redoubt in which Hadoop can hold up against both Spark and cloud storage.
Hadoop Growing in the Cloud
Even in the cloud, Hadoop is still used extensively, if unobtrusively. Take Amazon's new Athena SQL query service for its Scalable Storage Service (S3). S3 is a storage-only service. It doesn't have a baked-in compute engine. Amazon Athena uses the open source Presto SQL interpreter, which requires a separate compute engine. Amazon's solution is to use a Hadoop instance running in the context of its Elastic MapReduce (EMR) service to power Athena's Presto-based SQL query facility.
Anecdotally, data management vendors say they're seeing strong interest in Hadoop and, yes, Spark. "I would definitely say that we're finally seeing a lot of Hadoop projects ... really kicking into gear now," says Chris Jordan, president of business intelligence and analytics specialist iOLAP. "We're having big data, Hadoop-type discussions with many if not most of our customers now. It's them bringing it up. It's not us coming to them telling them they need to do this."
Don Mettica, iOLAP's senior vice president of analytics, says customers are specifically interested in spinning up Hadoop instances in the cloud -- in part because of the cost and complexity of standing up Hadoop in on-premises environments. "We were pitching an on-premises [Hadoop] platform to a very immature client because we thought that's what they wanted, but they said, 'No, we want this to be in the cloud from the get-go.' That was a little surprising to us," he says.
Lovan Chetty, director of product management with big-data-as-a-service specialist Cazena, says there's plenty of demand for Hadoop in the cloud. Cazena has two big-data-as-a-service offerings: a data mart service, based on massively parallel processing database platforms, and a data lake service. Cazena's data lake can be hosted in either Amazon's EMR or Microsoft's Azure HDInsight service.
The catch, Chetty concedes, is that basically all of Cazena's data lake customers are also running Spark in the context of Hadoop. This makes sense because Spark isn't a database and doesn't have a persistence layer. Amazon's S3, as mentioned, doesn't have a compute layer and it is a (relatively) cheap source of storage for a data lake service. Hadoop running in the context of EMR can pull data from S3 -- or persist it locally if necessary -- and also play host to the Spark compute engine.
"The analysts and data scientists we're working with, when they say 'Hadoop,' they actually mean Spark [running in the context of Hadoop] 90 percent or more of the time. We don't actually have anyone that we're working with who's using [Hadoop's] MapReduce [compute engine]," he points out.
With So Many Uses, Hadoop's Here to Stay
These anecdotes are consistent with Russom's conclusion "that Hadoop is making steady progress as a platform well suited to many purposes in data warehousing and analytics." It's also consistent with TDWI survey data that projects that the percentage of organizations (36 percent) that plan to integrate Hadoop with a data warehouse will more than double between now and 2020.
TDWI's survey might reflect an on-premises bias, but Hadoop's future in the cloud seems no less compelling. From a data management perspective, then, it's hard to argue with Russom's conclusion that "Hadoop is here to stay and will soon become common in data warehouse programs."