Hadoop Reconsidered: A Data Management Perspective
After half a decade, Hadoop remains a divisive technology. Some data managers think Hadoop has the potential to be hugely transformative; others see it as a more pedestrian proposition.
- By Stephen Swoyer
- March 19, 2013
[Editor's note: A response to this article from a quoted source is posted on the last page.]
Hadoop started out as a mostly developer-driven effort. From the perspective of many in data management (DM), it's still mostly developer-driven.
This is why Hadoop remains a divisive technology for many data managers. Yes, there is a divide: developers don't understand DM and its set-based orientation; DM practitioners don't understand enterprise software development, which tends to emphasize what might be called a "procedural" world view.
There's likewise resentment on both sides; to hear some in business intelligence (BI) tell it, software developers are tone deaf to DM: they lack an understanding of DM concepts and methods; if or when their projects require the functionality of an RDBMS, they tend to prefer expediency or thrift at the expense of robustness and scalability.
"I've seen it so many times I can play my half of the conversation with a tape recorder," says industry veteran Mark Madsen, a principal with information management consultancy Third Nature Inc. Prior to founding the consultancy, Madsen logged time in executive positions with several firms -- including Harry and David.
The conversation goes like this, says Madsen: "IT will come to me and say 'We used MySQL for a data warehouse, then it got big so we had to do it over with vertical partitioning, then we had to do it over with that and a centralized aggregate store, then we realized databases [are problematic] and used Hadoop.'"
He describes it as a kind of pathology. He stresses that there's merit in Hadoop and big data: as a scalable data processing platform, he notes, Hadoop is without equal.
Madsen is likewise a big proponent of open source software (OSS); he nevertheless thinks enterprise IT is pathologically less alert than it could or should be to the merits of traditional DM platforms. He's also concerned about the potential for what might be called big data "malpractice," particularly with all of the hype attending Hadoop.
"I see this over and over again: people in IT [are] so ignorant of things that have been around for decades that they reinvent them in expensive and misguided ways. If you argue [the point], they say 'But we can't afford Oracle!' and if you say 'Not Oracle, [we can use] a parallel database for that purpose,' they say 'But this [i.e., the technology in use] was open source and it's free.' Then you tally the team of six-figure salaries [that's] needed to build and maintain the collection of open source projects that [they're using to maintain] their database – [a database] that easily fits in a node or two of Teradata, or even in a single Sybase IQ server."
Manan Goel, senior director of product marketing with Teradata Aster, concurs -- to an extent. He calls Hadoop "a good conversation starter," inasmuch as it, or the hype attached to it, helps to inform prospective customers about both problems and potential solutions.
"Once you get into the details of the use cases and how customers want to use Hadoop ... what we're seeing more and more is that they're looking to market-ready technologies, out-of-the-box solutions -- like Teradata Aster -- to solve the big data analytic use cases," Goel told BI This Week at TDWI's recent World Conference in Las Vegas.
"They start out wanting to talk [Hadoop], but once they find out what's involved, what they want is Hadoop ... just without a lot of work, effort, or Java expertise. They're much more receptive to other [solutions]."
Believe the Hype?
That said, many data managers believe Hadoop truly is hype-worthy.
Last year, for example, Dave Inbar, senior director for big data products with data integration (DI) specialist Pervasive Software Inc., famously described Hadoop as "a beautiful platform for all kinds of computation." Hadoop, Inbar continued, elegantly addresses several long-standing problems -- including "the data distribution [problem], the coarse-grained parallelism problem, and distribution of computation problem."
Inbar isn't alone in his enthusiasm for Hadoop.
Take Scott Davis, who's perhaps best known as the co-founder and CEO of LyzaSoft Inc., a business intelligence (BI) player that specializes in collaborative discovery. So far this year, however, Davis has spent a lot of time talking about Hadoop.
The inescapable fact, he argues, is that Hadoop is hugely transformative; certainly, it's over-hyped -- what about big data isn't? -- but it's shockingly substantive, too.
Hadoop likewise flouts -- or simply ignores -- one of the Iron Laws of computer science, says Davis: the concept of what he calls "delamination" -- i.e., that platform architectures must be separate; that -- in Hadoop's case -- the compute and storage layers must be conceptually and technologically discrete. "Hadoop says: 'I'm going to make the compute and storage layers bound to one another in a way that does not allow them to be delaminated; I'm going to do that intentionally, because I think that in doing so, I can get some amazing performance benefits without [having to have] a tightly scoped computational space.'"
For certain kinds of applications, Davis argues, this is categorically the case: "For what [its creators] wanted to do for high-complexity, high-scale, processing-intensive tasks, it rocks, but you have to live inside that sort of functional space."
Before he founded Lyzasoft, Davis helped found Eyeris, a provider of hosted profitability and analysis services for transportation firms and telcos.
The hosted Eyeris solution uses a MapReduce-like computing algorithm that originally ran over a distributed cluster or grid of systems using a storage area network (SAN) from EMC Corp. as a distributed storage layer. This was circa-1999.
Fast forward almost 15 years, says Davis, and you have an open source software (OSS) solution -- i.e., Hadoop -- that not only performs a similar function but that implements a distributed file system -- which can run across a heterogeneous mix of hardware -- in place of a physical storage layer that's tied to a specific manufacturer and product. You have, in effect, a scalable data processing platform that will run on almost anything.
In porting its service to Hadoop, Davis says, Eyeris saw "a two-order-of-magnitude step-function increase" in price-performance. That's big, he says:
"The basic economics you need to understand about Hadoop ... [are that] as long as your scale curve has some slope, ... it doesn't actually cost you any more money to go faster. If the machines are priced by the hour, which they are; if the process runs two-times as fast on 1,000 nodes as it does on 500, which it does, it's free. It's the only place in technology that [this kind of performance] can be free."
Davis concedes that Madsen and other technologists have valid criticisms of Hadoop and big data. He likewise argues that -- for those applications or use cases in which the economic case for Hadoop is simply insuperable -- such criticisms are inapposite.
He cites three such use-cases: big data analytics, data archiving, and supercharged ETL. This last is a big category, Davis maintains.
"If you ask any data scientist, they will tell you that somewhere between 80 to 90 percent [of the time] they spend on any analytic project is conforming data to be fit for the [types of] analytics they want to run: getting the data ready," he says. "This is really just another term for the same sorts of things that happen in ETL; maybe you're not using an ETL tool, but you're reshaping the data, and as a pure ETL process, Hadoop will smoke anything. It has shocking, just staggering advantages."
Like ETL, Hadoop until now has been a batch-only proposition. Vendors such as Cloudera Inc., EMC Corp., Hortonworks Inc., and MapR Inc. have separately announced technologies that aim to compress or condense this batch interval. The recasting of Hadoop as an interactive data processing platform is the New Frontier, Davis contends.
"If we figure out a few things -- like how mere mortals can interact with the system without having to write the most advanced Java code ever known to man, or how we can harness all of this compute power for something that is not inherently batch -- Hadoop could be just an insanely transformative technology," he concludes.
Third Nature's Madsen is sympathetic to Davis' arguments, as well as to those of other "pragmatic" Hadoop boosters. In the near-term, however, he thinks the number of genuinely Hadoop-ready applications will be comparatively limited.
"The trick [with Hadoop] is that you have to be able to recognize when your problem is appropriate to this environment and when it isn't. If it isn't, you can run into the situation where you use 100-times the resource of a database, and the cost and complexity outweigh what you can do with the tools already available to you," he points out. "Hadoop is good for stepwise batch execution of explicitly parallel problems. It turns out [parallelism of this kind is] a 20-percent[-of-the-time] kind of problem."
- - -
Updated 3/25/13: Reply from Scott Davis, who was quoted in this article
I enjoyed our interview that played a small role in your recent article on Big Data and DM. If you would indulge me further, I would like to clarify one thing that I think did not come out quite straight in your notes from our conversation.
Specifically, I do not believe Hadoop is or ever will be the appropriate tool for interactive analysis. There's a high set-up and overhead cost of every job in Hadoop, which is then followed by very low incremental cost of repeating that job across more data. To put it another way, Hadoop does not scale down very well.
Interactive analyses (such as visualization) are usually running on very bounded or scaled-down subsets of data, for which the SQL/RDBMS approach has the huge advantage of being able to leverage caching and pre-aggregations. I think we can make Hadoop more usable by providing interface layers that do not require coding; however, the way Hadoop executes commands and moves data will limit its application to high-scale, complex batch processes -- not interactive, responsive analytics. We can make those batch experiences more convenient, but we cannot make them "right now" responsive.