TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Modernize and Govern: Unifying Your Data Strategy July 10, 2025
  - Expert Panel: Best Practices for Modernizing Your Data Environment July 14, 2025
  - Powering Data Science with AI-Driven Tools and Practices July 15, 2025
  - Data Integration for AI: Overcoming Modern Pipeline Challenges July 23, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Executive Summit AI Accelerate 2025, Brought to You by AI Boadroom & TDWI August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
- Virtual Live Seminars
  - TDWI Data Governance Principles and Practices: Managing Data as an Asset June 25, 2025
  - Building Your Company’s Data Governance Roadmap June 25, 2025
  - Data Governance: Driving Engagement and Organizational Change June 26, 2025
  - A Framework for Modern Data Governance June 25, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

TDWI Articles

Getting the Most from Hadoop in the Cloud

Keep these five considerations at the forefront of your planning and you'll be able to eke out every ounce of performance possible from your Hadoop cluster in the cloud.

By Sachin Sinha
July 7, 2016

Two of the biggest trends in technology right now are Cloud Computing and Hadoop. With the growing popularity of cloud computing, enterprises are seriously looking at moving any and all workloads to the cloud. On the other hand, enterprises have started to evaluate Hadoop -- a technology that makes processing and analysis of modern day big data workloads possible. While Hadoop and Cloud options are evaluated, one of the questions frequently asked is, "Can we run Hadoop in the cloud?"

There are three flavors of Hadoop in the cloud.

Hadoop as a service in the public cloud: Amazon's EMR (Elastic MapReduce) provides a quick and easy way to run MapReduce jobs without having to install a Hadoop cluster on its own. This can be a good way to develop Hadoop programming expertise internally within your organization or if you only want to run MapReduce jobs for your workloads. MapR also provides an option that can be easily deployed using Amazon's EMR to deliver strong features such as instant recovery and continuous low latency.

Prebuilt Hadoop in the public cloud marketplace: Hadoop distributions (MapR, Cloudera CDH, IBM BigInsights, Hortonworks HDP) can be launched and run on public clouds such as AWS, Rackspace, Microsoft Azure, Google Compute Engine, OpenStack, etc. AWS and MapR together provide a converged platform that can be easily deployed with a choice of several editions. A big advantage of this strategy is that you can quickly stand up not just Hadoop infrastructure but also integrate it with other technologies (such as Amazon Redshift) to build a complete enterprise data hub.

A similar offering on Microsoft Azure allows customers to seamlessly transfer data between MapR and Microsoft SQL services within Azure and also provide access to Microsoft Power BI via SQL to data sources in MapR cluster.

Build your own Hadoop in the public cloud: You can use infrastructure-as-a-service (IaaS) for this. In a public cloud, you are sharing the infrastructure with other customers. As a result, you have very limited control over which server the virtual machine (VM) is spun up on and what other VMs (yours or other customers') are running on the same physical server.

There is no "rack awareness" that you have access to and can configure in the name node. The performance and availability of the cluster may be affected as you are running on a VM. Enterprises can use and pay for these Hadoop clusters on demand.

There are options for creating your own private network using VLAN, but for the best Hadoop cluster performance, we recommend you have a separate, isolated network because of high network traffic between nodes. With this option, you roll up your sleeves and install and configure the Hadoop cluster in the cloud on your own.

Why Hadoop Belongs in the Data Center

Even with all these options, the Hadoop and cloud mega trends may not be ready to be integrated just yet. There are three main reasons why Hadoop belongs in an enterprise data center for now rather than in a cloud computing environment:

1. Heavy and increasing workloads favor on-premises Hadoop. Hadoop clusters tend to be heavily utilized, with capacity being added as resources get scarce rather than being massively over-provisioned. In other words, whether slow and steady or fast and steady, Hadoop clusters get fed data in a mostly predictable fashion, without the peaks and valleys that normally lend themselves to an elastic cloud deployment.

2. Cloud storage is both slower and more expensive for data sets that just keep growing. Cloud storage may have unacceptably long access times, and cost comparisons don't indicate it's inherently cheaper anyway. In addition, Hadoop tends to collect 10 times or more data than legacy transactional environments do, data scientists and their customer-focused business stakeholders will almost never want to discard Hadoop data, and the access requirements are unpredictable -- all of which favors on-premises storage.

3. Data sources and locality make a big difference for performance. Although running Hadoop clusters in the cloud may make sense when the data itself is generated in the cloud (e.g., analysis of Twitter), for real-time, customer-facing systems with data coming from multiple venues, your operations department will likely need to build Hadoop out in a physical facility with the right (deterministic bandwidth and latency) network interconnects to minimize the end-to-end latency of the application.

Getting the Greatest Performance

If, however, you do decide to play with fire and want to install Hadoop in a cloud IaaS, then take note of these best practices.

Compute Capacity

It's important to carefully pick compute capacity. Not all compute nodes are equal. There is a buffet of different compute nodes. Some are heavy on processors while others give you more RAM. Hadoop is a compute-intensive platform and the clusters tend to be heavily utilized. It's important to pick compute nodes with beefier processors and higher amounts of RAM.

As a general rule of thumb, allowing for 1-2 containers per disk and per core gives the best balance for cluster utilization. The same goes for the vcore-to-RAM ratio. With YARN resource management, a poor ratio will be heavily penalized on the system utilization. Another general rule of thumb: allocate 2-4 GB per core at a minimum.

Sometimes you can upgrade your compute nodes to a higher level as your needs change, but this does not apply to all available compute nodes. Some premium compute nodes are only available in certain data centers, so you can't upgrade to them if your existing compute nodes are provisioned in a different data center.

Network Bandwidth

Our recommendation for peak performance on Hadoop is to have a 10GB network to accommodate high traffic between nodes. However, you can only dream of getting that in cloud IaaS. Most times, cloud vendors will guarantee a network but only its uptime and not its actual throughput. That's why it's essential to keep in mind network bandwidth when you select compute nodes.

Only certain premium compute nodes generally support gigabit or infiniband, and we strongly recommend you select those. You will never actually get gigabit speeds on your network because that network is shared between multiple tenants, but it will still be several degrees better than the pedestrian, regular network you will get on the non-premium compute nodes.

Disk Bandwidth

In on-premises Hadoop, we seldom worry about disk bandwidth because disks are locally attached via SATA channels and provide a very high throughput. However, in an IaaS cloud, storage is seldom attached locally and hence the network affects how much bandwidth is available for read and write operations to the storage area (also known as disk bandwidth).

You might provision your virtual hard disks as local mount points, but they are hostage to the available disk bandwidth. If you have SLAs that require faster jobs, then keep an eye on this metric while configuring your infrastructure in the cloud. You might be able to get a guaranteed IOPS bandwidth out of the node on par with a physical server, but the cost may be prohibitive.

IOPS

In an IaaS cloud, all the disks are bound by IOPS limit -- typically no more than 500 per disk. Hadoop, being a high I/O environment, will most likely always break that limit when running heavy jobs. The result: throttling of your disks. All of a sudden your jobs will start to crawl instead of run quickly and you won't know what hit you. It's the IOPS on the disk that you exhausted.

Storage

In most IaaS solutions, disks are generally part of something called storage accounts. Storage accounts have their own limit on total requests made in a time period. If you attach too many disks to the same storage account, you will almost always hit that limit. In Hadoop deployments, a best practice is to have one storage account per node so that all the disks from that node are part of the same storage account.

Another thing to note is that storage accounts are generally charged by the capacity used and not by the capacity provisioned, so you only pay for what you use. Therefore, you should always attach the maximum data disks allowed for the virtual machine size of your node. Also, the data disks should always be the maximum size allowable, which in most cases is one terabyte.

Recently some cloud IaaS providers started to offer premium storage built on SSDs and sometimes it is locally attached to the provisioned VM. Be extremely careful. Premium storage is generally charged by provisioned capacity and not by actual usage. Premium storage also costs five or more times what regular storage costs; keep that in mind if you are utilizing premium storage for your data nodes.

Data Transfer

If you ever have to transfer data in or out of your Hadoop cluster in the cloud, it will cost you dearly because data will be traveling across the Internet and will cost you at a per-GB rate. Even if the transfer is happening from one location to another (e.g., from the east coast of the United States to the west coast), it will still cost you money. Keep that in mind when provisioning your cluster.

Certain features such as premium storage are only offered at certain locations. If you plan to use that in the future, provision your cluster in the location that offers this feature or else you will be paying data transfer fees to move your entire cluster and data from one location to another. Also, allow yourself plenty of time for those cluster-to-cluster copying or data-migration tasks because when the transfer goes over the Internet it's slow -- really slow.

A Final Word

At this time, on-premises Hadoop is still your best option, but careful planning -- keeping these five considerations at the forefront of your choices -- will help you eke out every ounce of performance possible from your Hadoop cluster in the cloud.

About the Author

Sachin Sinha is director of big data analytics at ThrivON. In this role, Mr. Sinha is responsible for design of innovative architectures, development of methodologies, and delivery of solutions in big data, analytics, and data warehousing that help clients realize maximum value from their data assets. For over 15 years, Mr. Sinha has designed, architected, and delivered big data, data warehousing, and business analytics solutions. Specializing in data engineering and architecture, Mr Sinha's domestic and international consulting portfolio includes a broad array of organizations in the healthcare, financial services, insurance, pharmaceutical, and energy domains. You can contact the author at [email protected].

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.

↑

TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

Getting the Most from Hadoop in the Cloud

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI

Engage

Research

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

Getting the Most from Hadoop in the Cloud

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects, and Your Career

TDWI

Engage

Research

Accelerate Your Projects,
and Your Career