The Data Warehouse Improved and Reimagined
Two companies take a different approach to building the better-than-data warehouse data warehouse.
- By Steve Swoyer
- March 9, 2016
Between them, Sisense Inc. and AtScale Inc. articulate two similar-but-different takes on a next-gen data warehouse (DW) architecture.
If you're sympathetic, you'd describe what they're delivering as better-than-data-warehouse products. Both claim to offer DW-like systems that minimize the biggest bottlenecks of the traditional warehouse model -- data prep, data modeling, the creation and management of OLAP cubes -- while addressing the advanced analytics use case. What's more, both claim to exploit advanced in-memory and on-chip data processing technologies to accelerate performance.
Sisense is the more established of the two companies. Its flagship offering is powered by an in-memory columnar engine that exploits CPU-specific vector processing and on-chip parallelism (see Note 1 at end of article).
As an in-memory engine, Sisense says it exploits both system memory -- the dynamic random access memory (DRAM), often scaling to 128 GB and beyond, populating each node in a cluster -- and on-CPU cache memory. This latter category describes the Level-1 (L1), Level-2 (L2), and Level-3 (L3) (see Note 2) caches integrated into the microprocessor package itself. These caches can range in size from 32 KB (for L1, per core) to 8 MB or more (for L3, shared among all cores); Intel's high-end Broadwell EP-class Xeon chips come prepopulated with 55 MB of shared L3 cache.
Why does this matter? According to Sisense, its in-memory columnar engine pulls off a trick of sorts: it essentially "pins" data in the CPU cache -- i.e., it prevents the microprocessor from "flushing" data from its on-chip caches. The CPU and its constitutive cores can both write data to and read data from the on-chip caches much more quickly than to physical RAM; among other factors, write and read operations to and from RAM are constrained by the overall throughput of the memory bus.
To the degree that an application performs in-memory data processing and is also cache-aware -- i.e., able to persist chunks of data or queries in system cache -- it's much faster than a conventional in-memory application and orders of magnitude faster than a disk-bound application.
This is as much a function of data transfer speed, expressed in gigabits per second, as of latency, expressed in nanoseconds. The chief advantages of cache are (1) its much higher clock speed (see Note 3) and (2) its lower overall latency. "We've minimized the back and forth between the chip and the RAM," says Jeremy Sokolic, vice president of marketing with Sisense. "We've figured out a way to persist small sets of data into the CPU and we're able to reuse each of the little bits of data and analysis across multiple queries, so as you do more and more queries, we actually get faster."
Sisense achieves this by essentially countermanding the CPU's own prediction or prefetching algorithms. "In essence, the CPUs operating system [i.e., internal logic] optimizes and clears things out [of cache] as it sees fit. By chunking up the data in the way we have and by doing the right mappings, we're able to basically trick the CPU into persisting data [in-cache]," Sokolic explains.
Sisense can't persist everything into on-chip cache, however. That's where system memory -- which can scale to hundreds of gigabytes or even several terabytes in clustered configurations -- comes into play. The unit of analytic work in the Sisense universe is the so-called "elasticube." Unlike conventional OLAP cubes, elasticubes aren't precalculated or preaggregated: in the context of an analysis -- be it traditional ad hoc query or visual discovery -- they "live" in physical memory.
Because they're effectively elastic, elasticubes don't have to be pre-computed or pre-aggregated, Sisense claims. This simplifies both data prep and data modeling, maintains Sokolic, who argues that Sisense's use of on-chip, in-memory caching makes this possible. In other words, he explains, Sisense exposes the equivalent of a "quick and dirty" data prep facility -- basically, a wizard-driven interface that an analyst clicks through prior to accessing a data source -- to accelerate data prep.
The performance Sisense gets from its architecture compensates for the inelegance of the model. Analysts can blend or mash data together from several different sources, creating in-memory elasticubes. "We're not creating a view. We're not aggregating. We don't do that. Our [on-chip] processing is so fast we can give you the granular detail data, but we're able to couple that with [data] visualizations, because what we're providing is a single, end-to-end stack," says Sokolic.
AtScale at the Margins?
Like Sisense, AtScale also claims to exploit in-chip parallelism and vector processing, along with cache-optimized in-memory data processing. Like Sisense, it, too, emphasizes a smarter take on OLAP, via virtual, in-memory cubes. Whereas Sisense sits on top of a columnar database engine, AtScale acts as both a kind of distributed query optimizer and an in-memory OLAP engine.
In the former case, it takes SQL or MDX queries -- which are passed to it using tool-specific SQL or MDX interfaces -- and redirects them to the "appropriate" engines running in the context of Hadoop.
Topologically, AtScale itself doesn't run in Hadoop, stresses founder and CEO Dave Mariani; it sits outside the Hadoop environment -- "at the node edge" is how Mariani puts it -- and decomposes individual queries in order to determine where best to run them. For some queries, it might generate code optimized for Impala, a SQL-on-Hadoop interpreter that's well-suited for interactive workloads; for others, it might generate Spark SQL code optimized for Spark, the increasingly ubiquitous cluster computing framework that runs in Hadoop and Cassandra -- as well as on a standalone basis.
All told, AtScale generates optimized code for a number of in-Hadoop engines, including: Hive Query Language, or HiveQL, a query language for Hive, a SQL interpreter for Hadoop that (increasingly) is used in conjunction with Hadoop 2.0's YARN-native Tez engine; Presto, a SQL interpreter for Hadoop first developed by Facebook Inc. that has the support of data warehousing powerhouse Teradata Corp.; and Drill, an open source implementation of Google's "Dremel" that's promoted by MapR Technologies Inc., in particular. (Dremel is the distributed ad hoc query system that underpins Google's BigQuery infrastructure service.) One thing that distinguishes Drill from Impala, Spark SQL, Hive/Tez, and other Hadoop SQL interpreters is that it's designed for raw and multi-structured data. Drill is an ideal engine for querying against JSON, XML, and Avro files.
That's the thing, says Mariani. No single engine is ideal for any or all workloads -- nor should it be.
"What people don't realize is that there really isn't one engine that works for all workloads, so what we found in using Impala, Spark SQL, Hive-Tez, Drill, and Presto … [is that] we use those engines to do the heavy lifting while our platform sits on the edge node of the cluster, basically intercepting queries and then translating them, pulling them apart, creating aggregates, pinning those aggregates in memory," he says. Mariani claims AtScale's ability to redirect queries to fit-for-purpose engines enhances concurrency, too: "You get concurrency by making sure your queries are short-lived. Our whole strategy is to make sure every query can come back in a couple of seconds or less."
Unlike Sisense, AtScale doesn't have a BI front-end play. Instead, it acts as a SQL- and MDX-native interface between BI or data discovery front-end tools and Hadoop. In contradistinction to Sisense, which deemphasizes the use of aggregates, AtScale explicitly generates aggregates, Mariani says.
Its goal is to support both rapid query-response times (via the use of aggregates) and query detail. It also makes use of the dimensional model -- another encumbrance that Sisense eschews.
"The problem [for most IT organizations] is how can I get a dimensional model so that my users can self-serve? That's why AtScale exists. We believe in the dimensional model because we believe users want to consume data that way, but we don't believe in OLAP technology because we believe it fails to keep up with the complexity of data," he argues.
"We have kind of a ROLAP-style architecture, but we're basically managing the aggregates and performance on behalf of our users. We came up with an architecture that basically emulates Hive for tools that speak SQL and emulates SQL Server Analysis Services for tools that speak MDX. Any of those [BI] tools can use their native drivers. All of these analysts who have Excel can now use it [to query] a Hadoop cluster."
If this sounds familiar, it's because it is. Sort of. In terms of the kinds of problems they're trying to solve, both Sisense's elasticube and AtScale's virtual cubes have analogues, of a sort, in conventional decision support. (Kognitio and Teradata Corp., among others, offer virtual cube-like technologies.) Both Sisense and AtScale invoke ROLAP, the relational OLAP technology MicroStrategy Inc. made famous. The advantage of ROLAP is that it minimizes the work involved in preparing data for access, along with the actual building (and maintenance) of cubes. Cubes don't have to be precomputed or preaggregated but can be generated on-the-fly. The disadvantage of ROLAP was that it was relatively prohibitively difficult and costly to scale.
The Sisense and AtScale approaches use intelligent in-memory caching, CPU-specific optimizations, and general-purpose parallelism to mitigate this issue.
In the same way, a number of BI and data warehousing vendors now market in-memory technologies that exploit both on-CPU features -- such as SIMD parallelism and vector processing -- and on-chip caching. The Vector (nee VectorWise) database from Actian, is one such exemplar: VectorWise not only exploited in-chip vector processing but SIMD parallelism and on-chip in-memory caching. So, too, does SAP AG's HANA in-memory database.
As in-memory becomes increasingly common, more vendors will look to either rearchitect to be able to better exploit it or -- as with SAP and HANA -- develop completely new in-memory optimized applications or databases. For the present, both Sisense and AtScale are two of the earliest out of the gate. They each take a different approach: Sisense, by emphasizing its end-to-end bona-fides -- it's both a query-processing engine and a front-end tool; it bundles metadata management and other data management amenities -- AtScale, by playing up its platform neutrality.
"We're trying to enable enterprises to not change their business intelligence environment. Don't sell another BI tool to the end user [and] don't make them change their habits. If they want to be able to do fraud analytics in Tableau or Qlik, they should be able to use those tools to do it," says Mariani.
"The dimensional model is a great one. It's even better if there's only one of them. We can roll up all of these legacy BI tools that exist in the enterprise [and] allow them to keep doing what they're doing, which is dashboarding and visualization. We can centralize the semantics on AtScale. Those users, whether it's MicroStrategy or Tableau, or Excel, they're all going to see the same view of the data with the same access roles and the same rights and the same level of performance."
 In-chip parallelism comes via Intel's support for single instruction multiple data, or SIMD.
 In Intel's architectural newest specifications, the L3 cache is often called the "last level cache," or LLC. In place of DRAM, L1 cache uses a much faster form of memory called static RAM (SRAM).
 The L1, L2, and L3 cache on a Xeon E5 2699 V4 run at a maximum of 3.6 GHz, compared with 1.066 GHz for DDR4 synchronous dynamic RAM, or SDRAM. DDR is an acronym for "double data rate;" DDR4 designates fourth-generation DDR SDRAM, the fastest SDRAM RAM specification.