The Future of Big Data is Cloudy -- and Bumpy
Although 72 percent of companies are planning to deploy big data in the cloud, the migration path is not always smooth. These tips will help you avoid some common potholes.
- By Dave Mariani
- May 30, 2017
In my role as CEO of a big data analytics company, I spend a lot of time with CIOs and chief data officers. I'm exposed early to some of the challenges these executives are wrestling with. In every single one of my meetings, one topic dominates the conversation yet is seldom covered in the press: the implication of the cloud for big data and Hadoop.
This is odd because according to a recent survey, 72 percent of companies are planning on deploying big data in the cloud. With or without media attention, this is about to become a critical topic for most enterprises.
Cloud Is Not a Location
If you think that "big data in the cloud" is the same as "big data on premises" except on someone else's servers, think again.
Although big data as a service makes server provisioning and maintenance drastically easier, it doesn't obviate the need for ongoing management of the underlying software platform. In addition, deploying big data in the cloud can introduce a whole new set of challenges:
- Cost: The cost advantages of a public cloud may seem compelling at first. However, the "always on" nature of big data platforms can make hosting expensive and force enterprises to re-architect solutions to support demand elasticity.
- Lock-in: There's a danger of cloud vendor lock-in if you rely on proprietary tools or storage systems provided by a given cloud vendor.
- Data movement: Moving large amounts of data from on-premises servers to the cloud is not always feasible.
Let's look at each of these issues in greater detail.
Cost: "Always on" Means the Meter Is Always Running
Before starting AtScale, I ran the engineering group at Klout. My team managed a 200-node Hadoop cluster with more than 1.4 petabyes of data storage that ingested over 12 billion social events per day.
When we evaluated building our clusters on premises rather than in the cloud, we found that our ROI for buying physical servers and hosting them in our own data center was only eight months longer than for deploying in the cloud. Why? As it turned out, the cost for 200 cloud servers running 24 hours a day can add up quickly.
We also asked the obvious question: why not deploy to the cloud, but turn off the servers when we don't need them? The answer, we discovered, lies in the Hadoop Distributed File System (HDFS) and the "shared nothing" architecture of Hadoop. Essentially, because the data is stored locally with each physical server, shutting servers down means losing all your data!
Thus, the only way to achieve elasticity for Hadoop in the cloud is to separate storage and computing resources by using a block storage system such as S3. This can result in a sizeable performance penalty because I/O would no longer be distributed across the cluster -- a huge part of the Hadoop value proposition.
Lock-in: Don't Limit Your Flexibility
It's tempting to delve into the many new tools and technologies that appear in the cloud vendor marketplace almost every week. Be careful! Investing in a cloud vendor's proprietary technologies makes you beholden to that provider. Given the fast pace of innovation and pricing volatility, you may find yourself locked into an expensive or inferior cloud platform; it may take years to undo your investment.
Data Movement: Data Is Hard and Expensive to Move
No matter how fast your bandwidth becomes, it is often not technically feasible to move large quantities of data over the wire. When you examine use cases for cloud migration, start with the applications that already generate data in the cloud. This will drastically simplify data ingestion scenarios and open up the world of low-latency analytics to your end users. Save migration of on-premises data applications until after you've worked out the kinks of these new cloud platforms.
Additional Tips for Avoiding Pitfalls
Here are some further tips for smoothing your big data infrastructure migration to the cloud. They are listed in no particular order.
- Avoid architecting ingest, storage, and query environments on cloud-vendor-specific technology. Instead, whenever possible, rely on open source technologies and open, standard, and portable interfaces. It's a red flag if you can't simply forklift your big data environment to another vendor's cloud. If you must rely on vendor-specific technology, centralize and place a firewall around those interfaces so you can limit your migration costs later.
- Invest in demand-based, auto-scaling technology for your big data infrastructure. Always-on big data clusters can break your bank. Instead, try separating storage and computing resources for your Hadoop clusters or try leveraging query-as-a-service technologies that support standard SQL such as Google BigQuery or Amazon Athena. If you are deploying Hadoop in the cloud, make sure you can scale the number of data nodes based on daily or weekly demand.
- Prioritize projects that rely on data already generated in the cloud for cloud migration. If it's impossible to generate and store data in the cloud, invest in trickle-load technologies to move only new or changed data to the cloud.
The Kicker: Hybrid Rules
Once you've avoided these pitfalls, you'll likely still have to align your cloud efforts with the reality confronting most enterprises today: hybrid will rule. Here's a final tip: because migrations can take years (or even decades), make sure to choose software vendors that are architected for hybrid deployment scenarios.
The future is definitively cloud-y. Follow these tips and make sure your house can weather the storm!