Snowflake Computing: A New Take on a Data Warehouse in the Cloud
Although business analytics is highly personal, the components of an analytic practice -- starting with the data warehouse itself – must be highly personal, too.
- By Stephen Swoyer
- February 17, 2015
The problem with shifting analytic workloads to the cloud, or with spinning up completely new analytic practices in the cloud, is that you're doing it in the cloud. Analytics is a highly personal, quasi-bespoke business: no one company's analytic practice is going to look exactly like any other company's.
How does this jibe with the no-assembly-required model of the cloud? It might depend on what one means by "highly personal." Upstart cloud analytics player Snowflake Computing Inc. makes no secret of this. Analytics is highly personal, intensely personal, even, and building out an analytic practice really does require a kind of quasi-bespoke fitting or tailoring. This doesn't mean that the components of such a practice -- starting with the data warehouse itself -- can't be implemented in a cloud context, however.
This is what Snowflake purports to do with its data warehouse platform-as-a-service (PaaS) solution, which it says automates the sizing, provisioning, and scaling of a data warehouse -- along with its ongoing tuning, maintenance, and resizing. "One thing that's enabled by the cloud that we haven't seen anybody take advantage of so far is that you can deliver a full service that goes beyond eliminating the hardware and software install, which can automatically scale up and scale down [as needed]," says Jon Bock, vice president of product and marketing with Snowflake.
Core to this capacity to "scale up and scale down" is a concept that Bock dubs "multidimensional elasticity." This, he says, is perhaps the most distinctive advantage of the cloud model.
"What is the property you most associate with cloud? One of the things that always comes up is elasticity. Traditional software isn't designed for [elasticity], you couldn't scale up and scale down if you needed to. People had [come to] the point where they were adding new systems [to accommodate demand] and siloing data into all of these different types of systems," he explains.
An oft-touted advantage of both massively parallel processing (MPP) databases and the Hadoop environment is that both paradigms combine scalable storage with scalable parallel processing. Bock turns this advantage on its head, however, claiming that the Snowflake model actually separates computing from storage -- or, more precisely, eliminates the need to size and balance a system prior to deploying it. This is by no means a trivial problem. Any computing system, for example, is only as scalable as its capacity to ingest, store, and process data. (This is a well-known problem with MPP and no less of a problem with Hadoop.)
"In MPP, [computing resources] and storage are sitting next to each other. That has a lot of implications. If I want to store a huge amount of data and do a small amount of processing, I have to build a huge data warehouse. If I want to do a huge amount of processing and store a small amount of data, I have to build a huge data warehouse," Bock says.
To that end, Snowflake's PaaS data warehouse solution automates the background work. This includes, crucially, tuning, sizing, archiving, and managing of backups, among other tasks.
"There' a lot of care and feeding you have to do with a data warehouse today even if you take away the hardware infrastructure and the data warehouse software. Things like tuning and sizing, managing backups, and all of this is true even in the case of what are called cloud data warehouses today," he comments. "Fundamentally, we saw the ability ... to take a lot of that off of customers' plates, so everything from optimizing performance to managing how data is stored in the system to managing security, that's all provided in the service," he explains.
Most of what Snowflake does falls under the rubric of "special sauce." Bock, for example, claims that the Snowflake database was designed from scratch as a data warehouse system for the cloud. This is a critical consideration, he argues, because other new-ish distributed or cloud database platforms (such as NuoDB) are designed as OLTP-first platforms. Unlike these systems, Snowflake was designed and optimized for analytics workloads, says Bock.
Like most distributed databases, however, Snowflake implements an eventual consistency model, via multi-version concurrency control. Most (R)DBMS platforms, be they designed for OLTP or analytic use, adhere to traditional ACID requirements. (Increasingly, the term "basically available, soft state, eventual consistency" -- or BASE -- is used as a punning neologism for this model.) In any case, Bock claims, Snowflake's design gives it a competitive leg up versus other distributed database systems and versus traditional (R)DBMS platforms, which weren't originally designed to run and scale in a cloud context.
"In cloud data warehouses, scaling up is not a trivial operation: it can be a one button operation [from a user's perspective], but that one button kicks off a background process that could take several hours," he explains. "Shrinking a data warehouse system in the physical world has not really been an option. We talked to some customers [of cloud data warehouse services] ... and they told us about having to wait six to eight hours to resize their data warehouse. We deliver a design that effectively in seconds to maybe a few minutes can scale up and scale down that data warehouse."
What about Amazon? In Redshift, its PaaS MPP data warehouse offering, Amazon has a runaway success. (A well-placed source tells BI This Week that Redshift is one of the fastest growing services in Amazon's history.) With Redshift, Amazon purports to simplify -- if not to automate -- some of the same tasks Snowflake says it's automating. More to the point, Redshift is based on the former ParAccel database (now Actian "Matrix"), which is an ACID-compliant MPP engine; Snowflake, on the other hand, is based on a non-MPP database design.
Notionally, at least, MPP lends itself more readily to the cloud than does a traditional, non-MPP database architecture. For example, the way in which one scales an MPP data warehouse is by installing and configuring additional physical nodes; this performance is bankable: if you add four extra nodes to an existing four-node MPP configuration, you can reasonably expect to double the performance of that system. In theory, it's possible to predictably scale an MPP data warehouse in an on-premises context. In theory, it should also be possible to do so in the cloud.
Bock demurs, contending that even though Redshift might simplify the design and deployment of a data warehouse, it doesn't substantively automate these tasks -- not to the degree that Snowflake does, at least. The MPP-cloud advantage is moot, too, according to Bock: Snowflake was conceived as an analytic platform for the cloud. It's at least as elastically scalable (if not more so, he claims) than a conventional MPP (R)DBMS that's transplanted into the cloud. (As for MPP's performance advantage, Amazon Redshift doesn't offer query-performance or concurrency service-level agreements, nor does Snowflake.)
Finally, Bock says, even though Snowflake uses Amazon's cloud services infrastructure, it isn't tightly coupled to it: "We started with Amazon, because that is where people have the most data and the most processing going on right now. [However,] we made great efforts to make sure that we have the flexibility to support other clouds in the future."
Some Assembly (Will Always Be?) Required
Architecturally, cloud PaaS offerings tend to closely resemble on-premises analytic implementations: PaaS services from both Birst Inc. and Good Data Inc. implement something like data warehouse architecture-in-the-cloud even if they don't expose an explicit "data warehouse" system. The data warehouse -- as a consistent, time-variant, controlled repository; as an abstraction layer; and as a representation of a business's "world," complete with a data model and business views -- is nevertheless implicit in their architectures. (Consider this excerpt from a support document on Good Data's website: "Effective data modeling requires a distinct set of skills that may not be part of a general software engineering background. If you are unsure if you or your team has the appropriate skills, please contact GoodData Customer Support.")
Snowflake's service is no different. In the vast majority of cases, a business analyst or a self-starting business user couldn't spin-up a Snowflake data warehouse-as-a-service without also involving data management. Like the PaaS analytic services of Birst and Good Data, then, Snowflake's service requires specialized expertise. For example, data modeling is an inescapable requirement in the Snowflake model, just as it is with Good Data and, to a lesser extent, with Birst. (Birst claims to be able to automatically translate a logical data model into a denormalized star schema. This is something that data warehouse automation tools also do. You wouldn't attempt to use any of these tools without assistance of some kind from data management.)
This has to do with the richness of SQL, which is an expressive and highly productive language for working with strictly-structured data from (mostly) OLTP sources. If you're going to work with SQL, you need a data model. Even though Snowflake supports relaxed schema requirements (like many of its cloud and on-premises database kith, it can ingest and persist JSON and similar poly-structured data types), SQL is still the analytic lingua franca, at least for the kinds of analytics that are core to day-to-day decision-making.
What's more, most customers are still working with BI tools -- including Tableau and Qlik -- that expect to speak a SQL flavor of some kind. "The data modeling part is still something that a customer will need to do. It requires them to bring their intelligence about what they're doing," Bock says.
"SQL is still the core language we're using, so they will need to build a data model, although we give them the flexibility with JSON and other [poly-structured] data of not having to define a data model," he notes. "This is a necessary limitation of BI [architecture]: a BI tool doesn't understand that Snowflake has this special way of accessing data without a schema."