Understanding Hadoop: Foundations for Developing an Analytics Culture

What is core -- and important -- to understand about Hadoop and its adoption into the enterprise.

January 27, 2015

[Editor's note: Krish Krishnan is leading a session on understanding Hadoop at the TDWI Conference in Las Vegas (February 22-27, 2015), where he will discuss where and how Hadoop fits into your BI and analytics future. In this article he provides the fundamentals you must understand about Hadoop as an analytic foundation in your own enterprise.]

As 2014 was winding down and we were getting ready to predict the next generation of changes in technology and cultures for 2015 and beyond, there was excitement in the industry and the ecosystem about the trends of personal wearables, the Internet of things, and artificial intelligence. Most important, there was a cultural change towards developing and adopting analytic cultures and enterprises were beginning to realize how they needed to change.

One such change is to understand of the technology and infrastructure layers of the analytic foundations; the leading solution is Hadoop. There are several important aspects of Hadoop that any enterprise needs to learn and understand. Let us look at a few.

Data Processing

How do we process data in a Hadoop landscape? Do we load data first and then process it, or do we stream data and process it during the file load? This confusion about design and process has caused many Hadoop programs to go into a tailspin. The confusion here is not just about when to process but what programs in the ecosystem to use to complete the process. Do we use Pig Latin, Informatica HParser, or a combination of MapReduce with Pig Latin or Talend? More important, how do we integrate data across different layers? Where is HCatalog used and where do we process tagging and classification of data in the process?

ETL and discovery and analysis processing in Hadoop are more complex than meets the eye. In the new world of infrastructure, we will create a design strategy where a library of programs with different technology and data processing requirements will be developed. Once we have the library, the YARN processing architecture makes it easy to "plug and play" the programs as needed in the architecture and thus provides a clear pathway to successful implementation.

Which programs in the ecosystem are suited for what purpose? This is not easy to determine, but a combination of PIG + YARN, MapReduce + YARN, and HParser + Spark are all well suited for both batch and stream processing of different components of the data landscape. Today, the success of Hadoop within any enterprise and its scalability are both dependent on understanding these foundations and discussing them from a data perspective.

Hadoop Distribution

Do I use MapR or Cloudera or Hortonworks? My CIO is an IBM or Oracle or Teradata fan, so how do we build a proof of concept (POC) or business case? Do all distributions need the same license from Apache?

These questions are often discussed on panels and forums, but the end state results are not useful because they often cause more confusion than resolution. To understand the landscape of solutions and ecosystem from an open source software perspective, you need to understand the different components and discover the solution stack that is specific to each vendor.

For example, if you compare Hortonworks and Cloudera, you will find that Hortonworks is HIVE specific and Cloudera has developed Impala and is pushing the platform to solve analytic requirements. You need to compare vendors in order to understand the ecosystem before you develop use cases and look at executing a proof of concept.

Another example is Amazon EMR versus Hadoop on SQL from Actian, two very different solutions attempting to answer the scalability situation -- but done very differently from an architecture and solution perspective. Cloud versus onsite is another challenge from a vendor perspective where you need to look at what Treasure Data versus Teradata from a feature perspective and taking overall cost and ROA into account.

Confusing? Yes, but there are so many areas of discussion and opinions available that will help you get a clear understanding so you can best choose the solution features you need and understand the success factors for evaluating vendors and selecting one for your POC (and eventual implementation).

Another set of perspectives are available with McKinsey, Booz Allen, Forrester, and Gartner which will help you understand the vendors' financial situations and their long-term market strategies.

One final caution: be ready to create a heterogeneous architecture.

Hadoop Security

Security is always a hot topic, especially when it factors into enterprise adoption of Hadoop. What needs to happen in the ecosystem, where are we now, and what are our future plans?

In retrospect, when we started Hadoop in early 2009, the team I worked with wanted to create a crawler that would scan the Internet quickly and collect data that could be used to perform searches and provide highly accurate results. In that respect, our software had to process data ingestion and discovery at high speeds; the rest of the analysis and processing could happen in micro-batch environments. We were not worried at that time about information security, but the foundational security that was designed was to support Kereberos and LDAP driven architectures.

Between 2005 and 2010, with Hadoop as our platform, we kept the basic architecture available for security and in 2013 with SQL on Hadoop and Impala, Spark in-memory on Hadoop, and stream processing becoming more prevalent, the security requirements definitely increased. Today, we see many projects in Hortonworks and Cloudera focusing on data and user security. The future of security in Hadoop is moving in the right direction. Among the projects:

Apache Knox Gateway: Secure REST API gateway for Hadoop with authorization and authentication
Apache Sentry: Hadoop data and metadata authorization
Apache Ranger: Similar to Apache Sentry
Apache Accumulo: NoSQL key/value store with cell-level authorization
Project Rhino: A general initiative to improve Hadoop security by contributing code to the entire stack

The pace of evolution is fast pace and the results will emerge in multiple packages of Hadoop releases in 2015 and 2016.

[Editor's note: This article has been updated to correct a date error on page 3.]

Krish Krishnan is an industry thought leader and practitioner in data warehousing. His expertise spans all areas of business intelligence and data warehousing. Krish specializes in providing high-performance solutions for small and large BI/DW initiatives. You can contact the author at [email protected].

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

Find the right level of Membership for you.

Learn More

TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

RESEARCH & RESOURCES