In Praise of Elasticity
Get ready for the elastic data warehouse -- which is what, exactly?
- By Steve Swoyer
- April 21, 2016
There's a new EDW in town. No, it isn't the enterprise data warehouse. That's so last millennium. Nor is it the extended data warehouse, an attempt to bridge the SQL and NoSQL worlds.
It's the elastic data warehouse, which is -- what exactly?
In a basic sense, the elastic data warehouse is a data warehouse in the cloud, or DW-as-a-service.
That isn't quite it, however, argues Kent Graziano, senior technical evangelist with cloud data warehousing specialist Snowflake Computing Inc.
According to Graziano, there's a world of difference between a conventional, on-premises database and a database that's been designed with the benefits of the cloud-- especially elasticity -- in mind.
"It requires a different approach to loading the data, querying the data, and moving the data around for data warehouse workloads. Because of the nature of it, you need to have the ability to do different kinds of things without physically touching the hardware, so there's a service layer in there that allows you to manage things remotely via the cloud," he explains.
"You think of Salesforce.com as a cloud-based CRM system, so something akin to that, but optimized purely for dealing with the data. It allows you to access, load, and scale your data, scale your workloads, in the cloud, without a lot of messing around with the hardware."
Graziano says Snowflake was designed from scratch for the multi-tenant cloud. It's a massively parallel processing (MPP) database, which means it distributes data across multiple clustered nodes. This makes it a powerful query-processing platform, one that's notionally comparable to MPP DW-as-a-service offerings from Amazon Inc. (Redshift), Microsoft Corp. (Azure SQL Data Warehouse), and Teradata Corp. (Teradata Cloud). For Graziano, however, these and other MPP cloud data warehouse services are insufficiently elastic.
"Elasticity" on Snowflake's terms means a system that's designed not just with the advantages but with the constraints of the cloud model in mind. The foremost of these -- multi-tenancy -- is the type of thing that cuts both ways. On the one hand, multi-tenancy makes it possible for multiple workloads to coexist on the same physical hardware, simultaneously sharing access to virtualized compute, storage, and network resources.
Multi-tenancy is necessary in order for elasticity to be possible. Think of elasticity as that property -- unique to the cloud -- that permits a subscriber to scale up or scale down compute or storage capacity as needed. Business conditions change? Scale up or scale down on demand. Need to improve query responsiveness for certain groups or users? Scale up by adding more nodes. The beauty of MPP is that it can be predictably scaled: add four nodes to a four-node MPP cluster and you'll roughly double its performance. The beauty of MPP in the multi-tenant cloud is that you can add extra compute capacity at negligible cost. It's also much cheaper to turn off capacity when you no longer need it.
That's the good. The bad is that classic multi-tenancy can be hostile to decision support workloads. The reasons for this are complicated, having largely to do with the characteristics of analytic workloads, which are often both computationally intensive and involve lots of disk writes. In a multi-tenant context in which resources are virtualized, and compute and storage resources aren't "local" in the sense of an on-premises MPP configuration, the performance and responsiveness -- the availability -- of a data warehouse system could be impacted, right?
Yes and no, Graziano parries. If you're talking about a database system that wasn't originally designed for multi-tenancy, yes, he says, performance is likely to be impacted. Snowflake, he argues, uses several techniques to mitigate potential issues. For example, for primary storage, Snowflake uses Amazon's Simple Storage Service (S3) as a persistence layer. However, it also uses an SSD layer for caching data, as well as for temp space. The faster SSD layer helps to offset S3's latency.
What the Cloud Data Warehouse Changes
Will the performance of a multi-tenant MPP cloud data warehouse -- even an "elastic" data warehouse, such as Snowflake -- be roughly comparable to that of an on-premises MPP data warehouse system? Graziano says it will, but he's hardly a disinterested observer. More likely, performance and other availability characteristics will be impacted by the vicissitudes of the cloud model. In moving data warehouse workloads to the cloud, you're going to sacrifice control over some of the features (such as performance and availability) that you were able to tweak in an on-premises environment. (It's telling that most cloud data warehouse providers do not offer granular, performance-based service level agreements. Teradata's cloud offering is an exception.)
On the other hand, the cloud offers a slew of advantages vis-a-vis physical, on-premises implementations. There's elasticity, for starters, which radically changes how you plan for, budget for, procure, and maintain a data warehouse system. "In the traditional data warehousing world, whether it's your traditional on-premises databases or even your pre-packaged data warehouse appliances, there's actual hardware constraints on how many nodes you can buy, or how much disk you can buy, and you have to do all of that up front. With elastic data warehousing, you don't need to do that. You don't need to preallocate or prepurchase a certain amount of disk or a certain amount of compute power. It makes things a lot easier," Graziano argues.
In addition, the DW-as-a-service model eliminates data warehouse maintenance. DW-as-a-service a la Snowflake, Amazon, Microsoft, and Teradata aims to eliminate most of the tedious upkeep associated with the conventional, on-premises data warehouse model. In a sense, DW-as-a-service eliminates the problem of data warehouse obsolescence, too. System hardware doesn't have to be upgraded or replaced. That is done in the background by the service provider.
"You don't have to be an expert administrator, a systems administrator, a database administrator, to deal with this and make it work," Graziano concludes. "What are we going to do when we need to go from 100 to 1,000 users? We're going to spin up more compute clusters, that's what we're going to do. On the infrastructure side, you don't have those planning and budgeting problems anymore."
Stephen Swoyer is a technology writer with 20 years of experience. His writing has focused on business intelligence, data warehousing, and analytics for almost 15 years. Swoyer has an abiding interest in tech, but he’s particularly intrigued by the thorny people and process problems technology vendors never, ever want to talk about. You can contact him at firstname.lastname@example.org.