From Hollywood to Hadoop
I transcended time and space earlier this week when I attended Hadoop World in New York City.
It started Monday evening. After taking a high-speed train from Boston, I emerged from the bowels of Penn Station onto the bright lights and bustling streets of mid-town Manhattan. The pavement was wet from a passing rain and lightening pulsed in the distant sky, framed by the city’s cavernous skyscrapers. I felt like I had entered a Hollywood set for an apocalyptic movie. But that was just the beginning.
Invigorated by the city’s pulsing energy, I decided to walk 15 blocks to my hotel. Halfway there, the winds picked up, the muted lightening roared to life, and rain scoured the streets in endless waves. I ducked under a large hotel canopy just in time to see hail the size of shooter marbles pelt everything in sight. After 15 minutes, the deluge subsided. But by the time I reached my hotel, I was soggy and stunned.
The next morning, as I listened to the proceedings from Hadoop World, I realized that the prior night’s surreal weather was a fitting prelude to the conference—at least for me. Hadoop World was a confab for programmers—almost 1,000 of them. As a data guy, it felt like I had been transported to parallel universe where the people looked and acted the same but spoke a completely different language. But what I did understand, I liked.
With Hadoop, it seems that the application community finally discovered data and its potential to make businesses smarter. “Hadoop is a high value analytics engine for today’s businesses,” said Mike Olson, during his kickoff keynote. Mike is CEO and Founder of Cloudera, an open source provider of Hadoop software and services and host of the event. Following Olson on the stage was Tim O’Reilly, founder of O’Reilly Media, a long-time high-tech luminary and open source proponent. He said, "We are the beginning of an amazing world of data-driven applications. It's up to us to shape the world."
It was wonderful to see the developer community discover data in all its glory. To my fellow developers, I say, “Welcome aboard!” We’re all on the same page now.
Hadoop is one of the first attempts by the developer community to get their arms around data in a way that conforms to their skills, knowledge, and culture. From a data guy’s perspective, Hadoop is clunky, slow, and woefully immature. But it does have advantages. As a result, it’s already popping up in corporate data environments as a complement to analytical databases. For example, some leading-edge companies are using Hadoop to process and store large volumes of clickstream and sensor data that they then feed into analytical databases for query processing.
So what is Hadoop? It might be easier to say what it's not.
· Hadoop is not a database; it’s a distributed file system (Hadoop Distributed File System or HDFS) that scales linearly across commodity servers. It is also a programming model (MapReduce) that enables developers to build applications in virtually any language they want and run them in parallel across large clusters.
· Hadoop is not a transactional system; it’s a batch-oriented system that runs hand-crafted Map-Reduce programs. You are not going to run iterative queries in Hadoop.
· Hadoop does not support random data access; it reads and writes all data sequentially, which makes it tortuously slow for tactical updates and queries and mixed workload applications.
Today, Hadoop shines as an infinitely scalable data processing environment for handling huge volumes of data that would be prohibitively expensive to store and analyze in a traditional relational database or even a data warehousing appliance. Hadoop lets companies capture and store all their data—structured, semi-structured, and unstructured—without having to archive or summarize the data. Consequently, some companies, such as Comscore and CBS Interactive, use Hadoop as a massive staging area to capture, store, and prepare large volumes of data for delivery to downstream analytic structures.
The main advantages of Hadoop are:
1. Open Source. The software is free. And free is good compared to spending millions of dollars on a relational database to handle tens of terabytes to petabytes of data (if it can.) You can download individual components from the Apache Software Foundation, or purchase a “distribution” from third party providers, such as Cloudera or IBM. A distribution is a package of Hadoop-related applications that are tested to ensure compatibility and stability and delivered with support and professional services on a subscription basis.
2. Linear Scalability. Hadoop is an MPP system that runs on commodity servers. It scales linearly as you add more servers. It has minimal overhead compared to relational databases so it offers superior scalability.
3. Streaming. Hadoop is a file system that does not require specialized schema or normalization to capture and store data or a special language to access it. Therefore, Hadoop makes it possible to perform (high-speed) reads and writes. In addition, a new application called Flume lets Hadoop consume streaming event data. In other words, it’s easy to get large volumes of data in and out of Hadoop.
4. Unstructured data. Because of its schema-less design, Hadoop and MapReduce work well on any type of data. MapReduce interprets data at run time based on the keys and values defined in the MapReduce program. Thus, a developer can design the program to work against structured, semi-structured, or even unstructured data, such as images or text.
5. Minimal Administration. Hadoop automatically handles node failures, making it easy to administer large clusters of machines and write parallelized programs that run against the cluster.
The Future of Hadoop
We are in the early days of Hadoop. There is a tremendous amount of excitement and energy around the initiative. The open source community is innovating quickly and bringing to market new capabilities that make Hadoop more database-like and a better partner in corporate data centers. For example, the community has introduced Hive, a SQL-like language that generates MapReduce programs under the covers and makes Hadoop appear more like a relational engine. It has also released Pig, a dataflow language that makes it easier to create MapReduce transformation logic than writing low-level Java.
Conversely, some BI vendors are adopting elements of Hadoop. For example, database vendors, such as Aster Data and Greenplum, have added support for MapReduce. And many relational database and ETL vendors, such as Pentaho and Talend, have implemented or announced bidirectional interfaces for moving data in and out of Hadoop. In addition, BI vendors, led by DataMeer, are working on JDBC interfaces to Hadoop so users can execute reports and queries against Hadoop from the confines of their favorite BI tool. Expect a slew of announcements this year from the likes of MicroStrategy, SAP BusinessObjects, IBM Cognos, and others supporting Hadoop.
It's clear that we’ve entered the era of big data analytics. And frameworks, such as Hadoop, are helping to advance our ability to generate valuable insights from large volumes of data and new data types. Just as exciting, the developer and data communities are converging to address large-scale data issues. And while our language and approaches may differ, it won’t be long before we all sing the same tune with the same words.
Posted on October 15, 2010