Getting the Most from Hadoop in the Cloud
Keep these five considerations at the forefront of your planning and you'll be able to eke out every ounce of performance possible from your Hadoop cluster in the cloud.
- By Sachin Sinha
- July 7, 2016
Two of the biggest trends in technology right now are Cloud Computing and Hadoop. With the growing popularity of cloud computing, enterprises are seriously looking at moving any and all workloads to the cloud. On the other hand, enterprises have started to evaluate Hadoop -- a technology that makes processing and analysis of modern day big data workloads possible. While Hadoop and Cloud options are evaluated, one of the questions frequently asked is, "Can we run Hadoop in the cloud?"
There are three flavors of Hadoop in the cloud.
Hadoop as a service in the public cloud: Amazon's EMR (Elastic MapReduce) provides a quick and easy way to run MapReduce jobs without having to install a Hadoop cluster on its own. This can be a good way to develop Hadoop programming expertise internally within your organization or if you only want to run MapReduce jobs for your workloads. MapR also provides an option that can be easily deployed using Amazon's EMR to deliver strong features such as instant recovery and continuous low latency.
Prebuilt Hadoop in the public cloud marketplace: Hadoop distributions (MapR, Cloudera CDH, IBM BigInsights, Hortonworks HDP) can be launched and run on public clouds such as AWS, Rackspace, Microsoft Azure, Google Compute Engine, OpenStack, etc. AWS and MapR together provide a converged platform that can be easily deployed with a choice of several editions. A big advantage of this strategy is that you can quickly stand up not just Hadoop infrastructure but also integrate it with other technologies (such as Amazon Redshift) to build a complete enterprise data hub.
A similar offering on Microsoft Azure allows customers to seamlessly transfer data between MapR and Microsoft SQL services within Azure and also provide access to Microsoft Power BI via SQL to data sources in MapR cluster.
Build your own Hadoop in the public cloud: You can use infrastructure-as-a-service (IaaS) for this. In a public cloud, you are sharing the infrastructure with other customers. As a result, you have very limited control over which server the virtual machine (VM) is spun up on and what other VMs (yours or other customers') are running on the same physical server.
There is no "rack awareness" that you have access to and can configure in the name node. The performance and availability of the cluster may be affected as you are running on a VM. Enterprises can use and pay for these Hadoop clusters on demand.
There are options for creating your own private network using VLAN, but for the best Hadoop cluster performance, we recommend you have a separate, isolated network because of high network traffic between nodes. With this option, you roll up your sleeves and install and configure the Hadoop cluster in the cloud on your own.
Why Hadoop Belongs in the Data Center
Even with all these options, the Hadoop and cloud mega trends may not be ready to be integrated just yet. There are three main reasons why Hadoop belongs in an enterprise data center for now rather than in a cloud computing environment:
1. Heavy and increasing workloads favor on-premises Hadoop. Hadoop clusters tend to be heavily utilized, with capacity being added as resources get scarce rather than being massively over-provisioned. In other words, whether slow and steady or fast and steady, Hadoop clusters get fed data in a mostly predictable fashion, without the peaks and valleys that normally lend themselves to an elastic cloud deployment.
2. Cloud storage is both slower and more expensive for data sets that just keep growing. Cloud storage may have unacceptably long access times, and cost comparisons don't indicate it's inherently cheaper anyway. In addition, Hadoop tends to collect 10 times or more data than legacy transactional environments do, data scientists and their customer-focused business stakeholders will almost never want to discard Hadoop data, and the access requirements are unpredictable -- all of which favors on-premises storage.
3. Data sources and locality make a big difference for performance. Although running Hadoop clusters in the cloud may make sense when the data itself is generated in the cloud (e.g., analysis of Twitter), for real-time, customer-facing systems with data coming from multiple venues, your operations department will likely need to build Hadoop out in a physical facility with the right (deterministic bandwidth and latency) network interconnects to minimize the end-to-end latency of the application.
Getting the Greatest Performance
If, however, you do decide to play with fire and want to install Hadoop in a cloud IaaS, then take note of these best practices.
It's important to carefully pick compute capacity. Not all compute nodes are equal. There is a buffet of different compute nodes. Some are heavy on processors while others give you more RAM. Hadoop is a compute-intensive platform and the clusters tend to be heavily utilized. It's important to pick compute nodes with beefier processors and higher amounts of RAM.
As a general rule of thumb, allowing for 1-2 containers per disk and per core gives the best balance for cluster utilization. The same goes for the vcore-to-RAM ratio. With YARN resource management, a poor ratio will be heavily penalized on the system utilization. Another general rule of thumb: allocate 2-4 GB per core at a minimum.
Sometimes you can upgrade your compute nodes to a higher level as your needs change, but this does not apply to all available compute nodes. Some premium compute nodes are only available in certain data centers, so you can't upgrade to them if your existing compute nodes are provisioned in a different data center.
Our recommendation for peak performance on Hadoop is to have a 10GB network to accommodate high traffic between nodes. However, you can only dream of getting that in cloud IaaS. Most times, cloud vendors will guarantee a network but only its uptime and not its actual throughput. That's why it's essential to keep in mind network bandwidth when you select compute nodes.
Only certain premium compute nodes generally support gigabit or infiniband, and we strongly recommend you select those. You will never actually get gigabit speeds on your network because that network is shared between multiple tenants, but it will still be several degrees better than the pedestrian, regular network you will get on the non-premium compute nodes.
In on-premises Hadoop, we seldom worry about disk bandwidth because disks are locally attached via SATA channels and provide a very high throughput. However, in an IaaS cloud, storage is seldom attached locally and hence the network affects how much bandwidth is available for read and write operations to the storage area (also known as disk bandwidth).
You might provision your virtual hard disks as local mount points, but they are hostage to the available disk bandwidth. If you have SLAs that require faster jobs, then keep an eye on this metric while configuring your infrastructure in the cloud. You might be able to get a guaranteed IOPS bandwidth out of the node on par with a physical server, but the cost may be prohibitive.
In an IaaS cloud, all the disks are bound by IOPS limit -- typically no more than 500 per disk. Hadoop, being a high I/O environment, will most likely always break that limit when running heavy jobs. The result: throttling of your disks. All of a sudden your jobs will start to crawl instead of run quickly and you won't know what hit you. It's the IOPS on the disk that you exhausted.
In most IaaS solutions, disks are generally part of something called storage accounts. Storage accounts have their own limit on total requests made in a time period. If you attach too many disks to the same storage account, you will almost always hit that limit. In Hadoop deployments, a best practice is to have one storage account per node so that all the disks from that node are part of the same storage account.
Another thing to note is that storage accounts are generally charged by the capacity used and not by the capacity provisioned, so you only pay for what you use. Therefore, you should always attach the maximum data disks allowed for the virtual machine size of your node. Also, the data disks should always be the maximum size allowable, which in most cases is one terabyte.
Recently some cloud IaaS providers started to offer premium storage built on SSDs and sometimes it is locally attached to the provisioned VM. Be extremely careful. Premium storage is generally charged by provisioned capacity and not by actual usage. Premium storage also costs five or more times what regular storage costs; keep that in mind if you are utilizing premium storage for your data nodes.
If you ever have to transfer data in or out of your Hadoop cluster in the cloud, it will cost you dearly because data will be traveling across the Internet and will cost you at a per-GB rate. Even if the transfer is happening from one location to another (e.g., from the east coast of the United States to the west coast), it will still cost you money. Keep that in mind when provisioning your cluster.
Certain features such as premium storage are only offered at certain locations. If you plan to use that in the future, provision your cluster in the location that offers this feature or else you will be paying data transfer fees to move your entire cluster and data from one location to another. Also, allow yourself plenty of time for those cluster-to-cluster copying or data-migration tasks because when the transfer goes over the Internet it's slow -- really slow.
A Final Word
At this time, on-premises Hadoop is still your best option, but careful planning -- keeping these five considerations at the forefront of your choices -- will help you eke out every ounce of performance possible from your Hadoop cluster in the cloud.