Q&A: How Hadoop Blew Open the Door to Next Generation of Computing
A paper in the early 2000s about Google MapReduce helped to democratize distributed computing. A veteran of the industry discusses that time and the huge changes that continue to unfold today.
- By Linda L. Briggs
- June 30, 2015
The advent of the Internet of things calls for better ways to handle massive amounts of data. A Google paper about MapReduce in the early 2000s signaled a fundamental change in data architectures and distributed processing. In this interview, the second of two parts, Splice Machine CEO Monte Zweben, a long-time industry veteran, discusses how MapReduce, in his words, "broke open the big data world and democratized big data computing and distributed computing." [Editor's note: The first part of our interview can be found here.]
As we talk about the Internet of things, big data, and pending changes in how that data is managed and used, how important is the role of Hadoop?
Hadoop has changed everything. ... To give a little bit of history, I was sitting on an advisory board at Carnegie Mellon University back in the early 2000s. [We were looking at] a paper recently published by Google, the MapReduce paper. It was what preceded Hadoop. The paper was [about] a new computing paradigm -- the scale-out paradigm -- that Google was using. It really struck everyone in the computer science community in a very big way. ...
We've all basically struggled in computer science over the past 30 years to figure out a way to get many computers to work on a problem at once. People said if we could put a lot of computers together, we could solve much larger problems, but it turned out to be too hard. It turned out that you pretty much needed a Ph.D. in distributed systems in computer science in order to get computers to work together.
There were all these technical problems in getting computers to not "starve" each other -- meaning that one computer is waiting for something that the other one is producing, but that other computer is waiting for something that the first one is producing, and they lock up. There were all these technical problems like that in synchronizing machines to work together.
Then this paper came out and showed a way of avoiding all that and making it so that the average Java programmer could get hundreds or thousands of machines to work together.
The open source community, realizing the importance of this, in particular the paper that created Hadoop, replicated the Google infrastructure. They replicated the file systems, the MapReduce computation engine, and they replicated a database called Bigtable [a distributed storage system for structured data] from Google in Hbase.
That broke open the big data world and commoditize, or I should say democratized, big data computing and distributed computing. Now, suddenly, everyone could use hundreds or thousands of computers to attack problems.
You see that paper on MapReduce as a real turning point.
That is what I personally think broke it all open. It enabled programmers to take advantage of this massively disruptive architecture.
Now, what I see as the challenge today is this: IT is still in the dark and IT is in the dark because IT doesn't program computers, they use computers. They use platforms and architectures and databases. They're not in the business of developing software. They develop applications with software components.
I thought to myself (and my co-founders thought), how are we going to get the power of this distributed architecture -- which enables this huge distributed computing that will enable the Internet of things [and more] -- how do we get that out into the masses, into Fortune 500 and global 2000 companies and beyond?
Our view was that you had to deliver this power on something that everyone knew and understood, so what we're doing is bringing the power of Hadoop in the context of a relational database.
Everyone knows what a relational database is. The majority of applications built in the world are built on SQL in relational databases. What if the existing applications in the world, and the new ones that are going to be built, could be built in SQL but executed on Hadoop? How big could they be? How big could the datasets be, and how real-time could they be?
That is what I think is going to enable that second generation of applications that we talked about earlier.
We're trying to tackle that problem -- which is democratizing the power of Hadoop, not just for the programmers of the world but for IT, and we're doing it in the context of a relational database management system.
Is Splice Machine the only one doing this -- building a relational database management system that works with Hadoop?
I have both a yes and no answer to that. Lots of people -- lots -- have realized they need to make SQL work on Hadoop. They've recognized the power of Hadoop, they have their data onto the Hadoop file system, but they don't want their people to have to program in Java on MapReduce jobs. They'd rather access that data using SQL, so everyone jumped on that bandwagon to support analytics on Hadoop with SQL. There are plenty of people doing that.
However, nobody has tried to address that second-generation challenge -- to build a database that actually reflects the properties that traditional relational database management systems reflect. There's a computer science term for that; it's called ACID properties. It's a technical term that stands for atomicity, consistency, isolation, and durability. Suffice it to say that what ACID does is to enable concurrency. It enables multiple readers and writers of the database to keep the database consistent. That's what the first generation of applications on client-server machines required, and that's what Oracle and MySQL and Postgres and IBM DB2 and Microsoft SQL Server all provide. They provide the ACID properties for concurrency.
Now, what happened with the Hadoop world is, no one did that. Splice Machine is the only company that we know of in the marketplace that is really attacking that strategy, to truly power applications as you would with Oracle. Although there is a lot of effort to try to make Hadoop more accessible with SQL, we're uniquely differentiated in making Hadoop able to power real-time applications.