A Framework for Understanding the Big Data Revolution
This Information Life Cycle Framework helps you navigate the big data technology landscape and see big data in light of a complete, end-to-end solution that can tackle the entire information life cycle.
By Mehul Shah and Sachin Sinha
We are in the midst of big data revolution that holds significant potential in disrupting most of the industries and sectors. The 3Vs of big data have been touted many times at conferences and in journals, blogs, and periodicals. There are some early adopters and winners mostly in the Internet industry who have been successful in leveraging the big data for tangible top- and bottom-line growth. Apart from these elite few, most organizations are still in their big data infancy. There are multiple challenges that must be addressed before big data analytics goes mainstream.
The big data boom has spawned a variety of technology companies eager to help realize the dream. The existing incumbents have bolstered their product portfolio and revamped their marketing machine around big data. There are also quite a few "disruptors" who have emerged using open source frameworks and/or as spin offs from big Internet companies. The current buzz about big data has led to a plethora of vendors, each one claiming their technology to be the best at lowering the total cost of ownership (TCO).
With all these tools at your disposal, you might think it is easy to get started with your first project. The reality, however, is that most people still find it difficult to start a big data analytics project. Impediments and challenges include:
- Lack of a single vendor technology stack that handles big data from cradle to grave
- Disparate tools and a lack of integration of tools from different vendor
- Differing approaches and complex architectural choices
- Complicated licensing in an open source/subscription model resulting in big costs (or at least the fear of them)
- Lack of the right skills and clarity about the responsibility within the team or organization
- An inability to identify a use case that can provide a quick "win" to gain broader executive buy-in
- Lack of a robust data governance and quality framework to handle both traditional and big data
Information Life Cycle Framework for the Big Data Industry Landscape
To provide clarity and direction about some of these challenges, we have created an Information Life Cycle Framework for the big data landscape. We'll first define the framework, then discuss how it addresses some of the challenges we've described.
The Information Life Cycle Framework is comprised of two information life cycle categories: Incumbents (traditional players) and Disruptors (newcomers). We further subdivided each life cycle category into three traditional information life cycle stages: data acquisition (ETL, ELT, etc), data storage and warehousing (DBMSes -- columnar, row, MPP, data warehouse optimized, hybrid, Hadoop-based), and data analysis (business intelligence and advanced analytics).
|
Data Acquisition |
Data Storage and Warehousing |
Data Analysis |
Incumbents |
Informatica
Abinitio
IBM - Infosphere
Microsoft
Oracle
SAP
SAS – Dataflux
|
Teradata - AsterData
IBM - Netezza
Oracle Exalytics
SAP - HANA , Sybase IQ
EMC - GreenPlum
Microsoft
HP- Vertica
|
Microstrategy
IBM - Cognos , SPSS
SAP - BOBJ
Oracle - BI
SAS
Microsoft
|
Disruptors |
Talend
Pentaho
Cloudera (Sqoop , Flume)
Scribe (developed at Facebook)
Syncsort
|
Cloudera
Hortonworks
MapR
Amazon (Elastic MapReduce)
MongoDB
Cassandra
ParAccel
|
DataMeer
Splunk
Revolution R
Karmasphere
Alpine Miner
Automated Insights
Tableau
Jaspersoft
Tibco Spotfire
Clarabridge
|
The "Incumbents" comprise the old guards of the data industry; this includes pure-play vendors such as Informatica, Microstrategy, and Teradata who operate in one segment as well as the Big 4 (IBM, Oracle, SAP, and Microsoft) who tend to operate in all the three segments, claiming to provide a complete stack. The vendors in the "Incumbents" category have provided solutions to handle data for some time focused primarily on the two of three big data Vs -- volume and velocity. These vendors are now extending their support to handle the third V -- variety. Most of the Incumbents have products rooted in RDBMS arena and with MPP, columnar storage, and compression, they tend to support structured analytics and reporting on large volumes of data.
The "Disruptors" include up-and-coming vendors and startups that are based on a new paradigm of handling big data. The majority of vendors are basing their products on Hadoop/MapReduce ecosystem which is undergoing rapid evolution and emerging as a primary platform for big data. The Disruptors category also encompasses pure-play NoSQL database vendors such as MongoDB and Cassandra which tend to fill the void on the lighter areas of handling variety and large volumes. The new age tools do support both variety and volume but are still largely evolving on the velocity front for providing real-time analytics.
Benefits of the Information Life Cycle Framework
The Information Life Cycle Framework highlights that none of the upcoming vendors provides an end-to-end solution for handling and leveraging big data. You need to ensure that you pick the right set of vendors who can operate across the entire information life cycle.
For example, by procuring Cloudera's Hadoop distribution, you take care of the big data storage and processing needs but you still need analytics/BI to run on top to provide the real business benefit. Also, the right set of vendors you select should be able to interoperate together. For example, the analytics solution you procure from Revolution R should integrate well and harness the full power of Hadoop-based data storage appliance.
You might already have made significant investments in the technologies from incumbents. For example, you might be an SAP or an Oracle shop and want to build something on top of these existing investments. As depicted in the Framework, incumbents have extended their product lines to provide support for handling big data. It is smart, then, to investigate your current vendors before procuring the new disruptive technology. For example, if you currently leverage an Oracle stack, you might want to consider Oracle Exadata and Exalytics for further expansion of your big data program needs.
As we've explained, the Incumbents and Disruptors have a different architectural approach to handle big data. You might want to dive deeper into your unique business needs to decide which one fits better. Not all the three Vs of big data may be equally important in your environment. A majority of the enterprises in non-Internet-based industries still collect large volumes of transaction data that can come from both internal and external sources. These companies should look at tuning their existing architectures to handle and analyze large volumes of fairly structured data. They might only have a small percentage of their data in multi-structured format (Twitter feeds and Facebook interaction data) that can be handled in cloud via MicroStrategy or Clarabridge to perform the textual analytics and provide structured data feedback to be included with the rest of the data in-house for holistic analysis. If your needs are greater, you can consider SPSS, SAS, or R.
Summary
The Information Life Cycle Framework helps you navigate the big data technology landscape and see big data in light of a complete, end-to-end solution that can tackle the entire information life cycle. It helps you compare apples to apples (comparing MapR with Cloudera, for example), not apples to oranges (comparing Cloudera with SAS).
To get started with big data, you need to evaluate the current architecture and identify the gaps as well as put together a future target state. This framework can aid you with this exercise.
Mehul Shah is a senior manager focusing on information management and data governance for a top 10 financial services company. He is an accomplished IT manager with over 12 years of experience in information management and managing large, complex programs and projects related to enterprise wide business intelligence and data warehouse implementations, architecting and building dashboards and BI applications, and work with cross-functional teams. Mehul has an MBA in marketing and analytics and MS in Computer Science from University of Maryland and is also PMP Certified practitioner. You can contact the author at [email protected].
Sachin Sinha is director of business intelligence and analytics at ThrivOn where he is responsible for designing innovative architectures, developing methodologies, and delivering of solutions in analytics, business intelligence, and data warehousing that helps clients realize maximum value from their data assets. For over a decade, Mr. Sinha has designed, architected, and delivered data integration, data warehousing, analytics, and business intelligence solutions. Specializing in information management, Mr. Sinha's domestic and international consulting portfolio includes organizations in the financial services, insurance, health-care, pharmaceutical, and energy industries. You can contact the author at [email protected].