TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Data Integration for AI: Overcoming Modern Pipeline Challenges July 23, 2025
  - From Silos to Insights: Centralizing Data to Drive AI July 24, 2025
  - Expert Panel: Leveraging AI-Powered Solutions for Data Management July 28, 2025
  - A Generative AI Framework for Credit and Financial Markets July 29, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
  - Executive Summit TDWI Data & AI Leaders Summit Orlando: Governing Data, Analytics, and AI November 17, 2025
- Virtual Live Seminars
  - Data Governance Week July 21, 2025
  - Platforms & Architecture Week July 21, 2025
  - AI Bootcamp Week July 21, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

RESEARCH & RESOURCES

Understanding Hadoop: Foundations for Developing an Analytics Culture

What is core -- and important -- to understand about Hadoop and its adoption into the enterprise.

January 27, 2015

[Editor's note: Krish Krishnan is leading a session on understanding Hadoop at the TDWI Conference in Las Vegas (February 22-27, 2015), where he will discuss where and how Hadoop fits into your BI and analytics future. In this article he provides the fundamentals you must understand about Hadoop as an analytic foundation in your own enterprise.]

As 2014 was winding down and we were getting ready to predict the next generation of changes in technology and cultures for 2015 and beyond, there was excitement in the industry and the ecosystem about the trends of personal wearables, the Internet of things, and artificial intelligence. Most important, there was a cultural change towards developing and adopting analytic cultures and enterprises were beginning to realize how they needed to change.

One such change is to understand of the technology and infrastructure layers of the analytic foundations; the leading solution is Hadoop. There are several important aspects of Hadoop that any enterprise needs to learn and understand. Let us look at a few.

Data Processing

How do we process data in a Hadoop landscape? Do we load data first and then process it, or do we stream data and process it during the file load? This confusion about design and process has caused many Hadoop programs to go into a tailspin. The confusion here is not just about when to process but what programs in the ecosystem to use to complete the process. Do we use Pig Latin, Informatica HParser, or a combination of MapReduce with Pig Latin or Talend? More important, how do we integrate data across different layers? Where is HCatalog used and where do we process tagging and classification of data in the process?

ETL and discovery and analysis processing in Hadoop are more complex than meets the eye. In the new world of infrastructure, we will create a design strategy where a library of programs with different technology and data processing requirements will be developed. Once we have the library, the YARN processing architecture makes it easy to "plug and play" the programs as needed in the architecture and thus provides a clear pathway to successful implementation.

Which programs in the ecosystem are suited for what purpose? This is not easy to determine, but a combination of PIG + YARN, MapReduce + YARN, and HParser + Spark are all well suited for both batch and stream processing of different components of the data landscape. Today, the success of Hadoop within any enterprise and its scalability are both dependent on understanding these foundations and discussing them from a data perspective.

Hadoop Distribution

Do I use MapR or Cloudera or Hortonworks? My CIO is an IBM or Oracle or Teradata fan, so how do we build a proof of concept (POC) or business case? Do all distributions need the same license from Apache?

These questions are often discussed on panels and forums, but the end state results are not useful because they often cause more confusion than resolution. To understand the landscape of solutions and ecosystem from an open source software perspective, you need to understand the different components and discover the solution stack that is specific to each vendor.

For example, if you compare Hortonworks and Cloudera, you will find that Hortonworks is HIVE specific and Cloudera has developed Impala and is pushing the platform to solve analytic requirements. You need to compare vendors in order to understand the ecosystem before you develop use cases and look at executing a proof of concept.

Another example is Amazon EMR versus Hadoop on SQL from Actian, two very different solutions attempting to answer the scalability situation -- but done very differently from an architecture and solution perspective. Cloud versus onsite is another challenge from a vendor perspective where you need to look at what Treasure Data versus Teradata from a feature perspective and taking overall cost and ROA into account.

Confusing? Yes, but there are so many areas of discussion and opinions available that will help you get a clear understanding so you can best choose the solution features you need and understand the success factors for evaluating vendors and selecting one for your POC (and eventual implementation).

Another set of perspectives are available with McKinsey, Booz Allen, Forrester, and Gartner which will help you understand the vendors' financial situations and their long-term market strategies.

One final caution: be ready to create a heterogeneous architecture.

Hadoop Security

Security is always a hot topic, especially when it factors into enterprise adoption of Hadoop. What needs to happen in the ecosystem, where are we now, and what are our future plans?

In retrospect, when we started Hadoop in early 2009, the team I worked with wanted to create a crawler that would scan the Internet quickly and collect data that could be used to perform searches and provide highly accurate results. In that respect, our software had to process data ingestion and discovery at high speeds; the rest of the analysis and processing could happen in micro-batch environments. We were not worried at that time about information security, but the foundational security that was designed was to support Kereberos and LDAP driven architectures.

Between 2005 and 2010, with Hadoop as our platform, we kept the basic architecture available for security and in 2013 with SQL on Hadoop and Impala, Spark in-memory on Hadoop, and stream processing becoming more prevalent, the security requirements definitely increased. Today, we see many projects in Hortonworks and Cloudera focusing on data and user security. The future of security in Hadoop is moving in the right direction. Among the projects:

Apache Knox Gateway: Secure REST API gateway for Hadoop with authorization and authentication
Apache Sentry: Hadoop data and metadata authorization
Apache Ranger: Similar to Apache Sentry
Apache Accumulo: NoSQL key/value store with cell-level authorization
Project Rhino: A general initiative to improve Hadoop security by contributing code to the entire stack

The pace of evolution is fast pace and the results will emerge in multiple packages of Hadoop releases in 2015 and 2016.

[Editor's note: This article has been updated to correct a date error on page 3.]

Krish Krishnan is an industry thought leader and practitioner in data warehousing. His expertise spans all areas of business intelligence and data warehousing. Krish specializes in providing high-performance solutions for small and large BI/DW initiatives. You can contact the author at [email protected].

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

Find the right level of Membership for you.

Learn More

↑

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

RESEARCH & RESOURCES

Understanding Hadoop: Foundations for Developing an Analytics Culture

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

TDWI

Engage

Research