Q&A: Big Data Meets Hadoop
Big data and Hadoop are popular tech terms, but what does the relationship of these two technologies mean for BI professionals?
- By James E. Powell
- June 26, 2012
Big data and Hadoop are popular tech terms, but what does their relationship mean for BI professionals? For answers, we turned to Paul Flach, vice president, enterprise analytics delivery at Stream Integration, who is leading two sessions on big data and Hadoop at the TDWI World Conference in San Diego (July 29-August 3, 2012).
Is Hadoop going to drastically change or even replace my data warehouse environment?
Hadoop is going to revolutionize the way you do analytics and your ability to deal with enormous volumes of data that previously were not accessible. However, it is not going to drastically change existing data warehouse (DW) architectures and certainly won't replace them.
Remember, RDBMSes have been evolving for almost 50 years and have become sophisticated. Although it is not pure brute force, Hadoop does not offer the same sophistication. Hadoop should be seen as a means to complement the DW architecture by processing the data flows and analytics that are beyond DBMS capabilities in terms of variety, volume, velocity, and even veracity.
What are the main differences between relational database systems and MapReduce?
Both have their strengths and weaknesses. RDBM systems mostly depend on structured data with a known schema. MapReduce works best with unstructured data and can also work with structured data. Also, MapReduce is read-oriented and there are no update capabilities in its current capabilities. RDBM systems are strong in transactional processing where MapReduce is batch oriented.
One other difference to consider is compression. Hadoop is very limited in terms of your ability to make use of compression techniques, and when you are throwing around petabytes of data, you need compression.
What are the 3 Vs of big data and how are they used to determine the right solution for my architecture?
The 3 Vs are volume, velocity, and variety.
We all know that data volume is growing geometrically, so we know what volume is. Velocity considers those time-sensitive processes such as fraud detection, where data is streaming in at a rapid rate and needs to be monitored in-stream in order to maximize its value. Finally, variety means that we are dealing with any type of data, from structured to unstructured such as text, sensor data, audio, click stream, log files, and others.
If you are dealing with volume of data on its own, you do not necessarily have a “big data” problem. An MPP shared-nothing platform or appliance provides robust capability and many provide the near-linear scalability required to solve today’s data volumes.
If you have data volume and velocity, there are technologies, such as Infosphere Streams from IBM, that are in-stream analytics platforms that provide real-time distributed processing. The benefits of Hadoop can be fully realized when you have all three Vs, in which case you have a “big data” problem. Again, one of the major benefits of Hadoop that distinguishes it from RDBMSes is that it solves the problem of data variety.
What is the best way to introduce Hadoop into my data warehouse architecture to get started?
A good introduction to Hadoop is to look at it as an extract, transform, and load (ETL) technology to complement your existing environment as a means to process Web logs, social data, text data, or machine-generated data. Remember, it will not replace your ETL architecture as a means to stream data directly into your reliable relational structures.
The outputs of a pure Apache Hadoop implementation will be stored in Hbase, a column-oriented database. From here, the outputs stored in Hbase can be further processed and stored in your SQL-based data warehouse.
What are the dynamics at play that determine the effectiveness of my big data architecture?
When you are designing a big data system, you need to look at performance, fault-tolerance, and a flexible query interface.
Performance is perhaps the most obvious characteristic when designing a system and will have a direct correlation to cost savings, especially where you can avoid expensive hardware upgrades. The movement in Hadoop implementations is towards low-cost or commodity hardware. However, the trade-off to this approach is that you have a higher potential for failure compared to high-end reliable hardware and DBMS systems with built-in fault tolerant capability. As you scale your commodity hardware to meet your data volume, you increase the probability of failure.
From a user-interface point of view, you have to remember all the money you have invested in the SQL-based BI technologies that your organization is accustomed to. SQL is a standard that has given business analysts easy access to data through ODBC and JDBC connectivity, without having to deal with your database software directly. Your architecture must continue to be friendly to your analytical community that will continue to communicate through an SQL interface.
What is the biggest obstacle preventing the adoption of big data technologies in today's enterprise?
The biggest obstacle is the same as it has always been for BI in general: organizations are not developing analytical skills at the same rate that technology is developing. Big data technology has been most successfully implemented where the determination to produce sophisticated analytics has been driven by a highly skilled analytical community.
Google had an analytic problem to solve and they refused to be constrained by the 3 Vs, so they implemented MapReduce to solve that problem. If your organization is slow to adopt your self-service BI tool, you certainly don’t want to present Pig and Hive as a more user-friendly alternative.
What is the best investment you can make to develop this analytical culture?
Before you make an investment in any more technology, send your analysts to a community college-level statistics course. If we are going to turn the corner and enter this new era of “data science,” analytics must become as common as any other comprehension skill for the entire organization and can no longer be relegated to the scientists.