What "In-Memory" Truly Means
In-memory database engines seem to be everywhere. Trouble is, few so-called "in-memory" databases actually run completely in memory.
- By Stephen Swoyer
- May 21, 2013
[Editor's Note: This report has been updated. See note at end of article.]
From major RDBMS vendors to prominent analytic database specialists to niche players, in-memory database engines are seem to be everywhere.
That's part of the problem. According to the textbook definition, an "in-memory" database is a database that lives entirely in physical memory. Two classic examples are the former Applix TM1 OLAP engine (now marketed by IBM Corp.) and Oracle Corp.'s Times Ten in-memory database. Both databases run in (and are constrained by) the physical memory capacity of a system. A decade ago, for example, the TM1 engine was limited to a maximum of 4 GB on 32-bit Windows systems. (TM1 was also available for 64-bit Unix platforms; this enabled it to scale to much bigger volumes. OLAP itself is something of a special case: most OLAP cubes run or "live" in physical memory.)
Going by this definition, few so-called "in-memory" databases actually run completely in-memory; what they do, instead, is optimize for memory usage. Analytic database specialist Kognitio, for example, positions itself as an "in-memory" database; even though Kognitio extensively optimizes for memory usage, it does not run entirely in physical memory.
Industry veteran Mark Madsen, a principal with consultancy Third Nature Inc., sees a trend.
"The market is hyping in-memory as some panacea. It's the 2013 craze, and it's as bad as big data, but with less substance. [Oracle] Exadata adds flash cache, [so] it's suddenly [an] in-memory [database]. Any cache layer anywhere qualifies," he comments, noting that -- on these terms -- any and every database becomes an "in-memory" engine. A database that "optimizes" for processor cache memory, for example, could likewise claim to be "in-memory."
The same can be said for a database platform that incorporates PCI-E flash cache at some level, or which uses solid state disks (SSD) to complement conventional fixed disk drives (FDD).
Call it the cynical appropriation of in-memory technology.
Madsen contrasts this approach with that of a vendor such as Kognitio, which he says uses a "proper model" for memory optimization. Although the Kognitio database isn't a classic "in-memory" engine, it does make clever and effective use of memory to optimize performance. "I think the entirely [in] main memory definition is the best one, so that's what I use. Otherwise, it's memory optimized, like Kognitio, where it's smart about it, uses it wisely, carries out all operations in memory, but isn't 100 percent memory pinned," he explains.
In-Memory in Practice
A good example of a modern in-memory database technology is SAP's HANA platform. HANA can scale to support up to 80 processors and 1 TB of memory in a single system image. HANA actually implements three in-memory engines in a single platform: a columnar engine, optimized for SQL-driven analytics; a NoSQL-like engine for unstructured data; and a text analytic/graphing engine. SAP says it architected for in-memory because of the crippling I/O bottlenecks associated with conventional (FDD or even SSD) storage architectures. Because an in-memory database keeps an entire copy of a database or data set in physical memory, it doesn't have to load or access it from disk.
In theory, this makes it orders of magnitude more responsive than a conventional database, which is able to store just a portion of a database in physical memory; for the rest, it must read from and write to physical disk. As a result, in-memory access times are measured in nanoseconds, or billionths of a second; the access times of even the fastest physical disk devices are measured in milliseconds -- i.e., thousandths of a second.
It's true that solid state disks (SSD) are much faster than physical fixed disk drives, but they're nonetheless much slower than a physical in-memory configuration. It's the difference between flash memory -- with its slower read and write performance -- and the much faster performance of dynamic random access memory (DRAM).
In-memory is almost invariably paired with another architectural innovation: a column store engine. It's a no-brainer combination: a columnar architecture works by storing information in a single column, instead of -- as with a conventional database -- in rows and columns; for this reason, it's able to achieve compression advantages of an order of magnitude (or more) over conventional databases. Think of it this way: it's possible to compress 10 TB of data sitting in (for example) an Oracle data warehouse down to 1 TB or less in a HANA system. This means HANA could run an entire copy of that same Oracle DW from physical memory.
Best of Both Worlds?
A big problem with a textbook in-memory database implementation is that it's memory-bound, so it's constrained both physically -- i.e., by the maximum amount of memory that can be stuffed into a single system (or which can be used to populate all of the nodes in a cluster) -- and economically: physical memory is considerably more expensive than virtual memory. Compression can alleviate this issue, but it can't eliminate it.
HANA, for example, is designed to run on beefy systems with lots of memory; even in an age of dirt-cheap servers, this means it's priced at a premium.
This is one reason Teradata Corp. opted for a memory-optimized approach with Teradata Intelligent Memory, which will be available this month as an option for version14.10 of the Teradata Database. Intelligent Memory continuously monitors data usage to optimize for performance. In this scheme, frequently accessed data gets loaded into physical memory; the Teradata database tracks access and usage on an ongoing basis, copying "hot" data into (or purging it from) memory as needed.
This approach is similar to that of another Teradata technology -- Virtual Storage -- that first shipped four years ago. Teradata Virtual Storage uses flash and SSD cache tiering to accelerate performance: "cold" data gets stored in the conventional FDD tier; "warm" or "hot" data uses the faster SSD and flash cache tiers.
"We still believe in the traditional data warehouse," says Sam Tawfik, product marketing manager with Teradata. Teradata has dozens of customers with petabyte or multi-petabyte warehouse implementations, Tawfik explains: it would be physically impossible to run these systems entirely in system memory.
"The reason that we're doing this is that we don't think you're going to be able to fit all of your data in memory. It wouldn't make sense to say, 'We'll only take a subset of your data and that will be sufficient,'" he continues. "This [approach] lets them continue to leverage the power of that detail data but also take advantage of technologies like Intelligent Memory while still having access to all of that data."
Update from the author (5/22/13):
This article originally included the following paragraph:
Elsewhere, business intelligence (BI) discovery tools such as QlikView from QlikTech Inc. and Tableau from Tableau Software Inc. describe themselves as "in-memory" technologies; neither runs in (or is constrained by the capacity of) physical memory -- although both do claim to make use of available memory to optimize for performance.
The wording in this paragraph is not only misleading but wrong. The error stems from a claim that I failed to sufficiently develop concerning what's meant by the term “in-memory.”
The comparison with which I opened the article was to textbook in-memory technologies: databases (i.e., data and indexes) that run entirely in memory, and which optimize for memory use (by exploiting all three levels of processor cache, for example). The aim of the article was to illustrate the extent to which in-memory is being appropriated by a large number of vendors.
This is particularly the case in the database arena, where it matters most. An in-memory database means something very specific to many technology professionals. It means something else, I think, to many marketing pros, and to many customers. This isn't necessarily a bad thing. It's the engine of market-driven conceptual change at work.
The invocation of Tableau and QlikView was an inapposite comparison. It wasn't essential to my argument, even though it does have a kind of salience.
Try looking at it another way. If Tableau and QlikView don't optimize for memory usage and performance -- which I concede that they do -- how are they any different from a desktop Excel tool, running on a system without a disk cache (i.e., without virtual memory)? By the ever-loosening standards of what makes “in-memory” in-memory, Excel qualifies as an in-memory tool, too. My Excel tool might hit a wall (as a function of software or physical memory limitations) at 6 GB, 14 GB, or 30 GB, assuming that the operating system still insists on allocating to itself a minimum of 2 GB of physical memory. Can I consume a 500 GB data set in Tableau or QlikView? What about 1 TB? 10 TB? All in one take? Will my experience be entirely “in-memory?”
The aim of the article was to explore the extent to which the term “in-memory,” like analytics, is being appropriated in nontraditional or unorthodox ways. I failed to do that. I would like to thank the TDWI readers who pointed out my error.