NuoDB: A Database for the Cloud
There are two big problems associated with data warehousing in the cloud -- agile elasticity and ACID compliance. NuoDB, a database startup, aims to address both of them.
- By Stephen Swoyer
- November 13, 2012
There are two big problems associated with data warehousing (DW) in the cloud: agile elasticity and ACID compliance. NuoDB Inc., a Cambridge, Mass.-based database startup, says it addresses both of them.
Elasticity is cloud-speak for capacity. The promise of the cloud is that it's elastic: it can expand and contract with just a few mouse clicks. Need to add extra processing horsepower or additional storage? Just fire up a new compute or storage resource. Need to scale back? Click, click, click and you're done.
To be sure, it's relatively "easy" to expand capacity in a massively parallel processing (MPP) configuration. In most cases, it's simply a matter of adding additional nodes. This doesn't happen as quickly in the (physical) MPP world as it does in the (virtual) cloud. Given the way in which things get done in the enterprise, augmenting an MPP data store almost always involves planning, budgeting, purchasing, and -- although less of an issue these days -- implementation. Vendors and IT organizations have been working on compressing this process, but in the vast majority of cases, it still isn't "clickable."
That's the point: the cloud is agile in precisely those ways in which the IT status quo is not.
This agility makes the cloud a turbulent atmosphere in which to host a conventional relational database platform, however. In the cloud, database "nodes" can reside on the same physical or virtual systems -- or (more likely) on separate physical or virtual systems. They can access separate (physical or virtual) storage pools. They're subject to the vagaries of (both physical and virtual) network transport. They are, in short, anything but reliably available, if by "available" one means something analogous to the uptime, reliability, and responsiveness of an MPP DW.
This is the main reason it's difficult to scale data warehouses predictably and reliably in the cloud. This, then, is NuoDB's raison d'être.
Chief engineer Seth Procter says NuoDB uses a tiered approach -- comprising multiple, redundant tiers of transaction engines (TE) and storage managers (SM) -- to address the elasticity issue.
"The idea is that you have a 'host,' say 'Host A,' that's going to run our NuoAgent, which is just a little management program that plants a flag in the sand and says 'This host is available for use by NuoDB,'" explains Procter. "That immediately makes it available to our management APIs and to our tools [NuoDB Console], so that you can start to see what hosts are available, what resources [they have], and where you can think about running things."
A typical NuoDB "domain" consists of multiple, redundant TEs and SEs. Both can run on the same host. A "host" can likewise be any supported platform -- Windows, Mac OS, Linux (several flavors), Solaris x86, and two cloud platforms (Amazon EC2 and JoyentCloud) are supported -- running on almost any hardware form factor, including laptops.
Client or end-point access to the TEs is facilitated by one or more NuoDB "broker" hosts that make decisions about how to optimize or direct query traffic.
In this scheme, adding capacity is mainly a matter of adding additional brokers, TEs, or SMs, argues Procter. An added bonus is fault-tolerance. There's no single point of failure in the NuoDB model because an entire copy of a database is always available.
"What's really nice about the ... architecture is that each [storage manager] is a completely independent archive of the database, and they 'know' how to automatically synchronize [with one another], so that if you start a second storage manager process, it synchronizes with a live system that's running. When it's ready, it synchronizes live with a running database," he says. "Once it's [up and] running, you then go from running one full copy to two full copies of a database."
Existing cloud databases scale by means of provisioning new "instances" of a database engine. This is wholly different from the NuoDB model, Procter points out: the conventional cloud database basically entails running (and managing) additional copies of a database, much like you'd do were you trying to scale out an MPP data warehouse.
True, this process can be greatly simplified -- even automated entirely -- by means of under-the-covers scripting. True, virtualization helps make this approach a lot more efficient. True, elasticity and latency aren't necessarily an issue in a private or enterprise cloud. However, in most cases, you wind up using a kludge -- a federation layer -- to logically knit together your distributed cloud database instances.
There's something else at stake here, Procter argues. In the public cloud -- in the vision of cloud as a utility computing service -- this idea of scaling-by-spawning is fundamentally a bad fit. It's at once incongruous and anachronistic. The bottom line, he argues, is that the traditional DB-as-a-service or DW-as-a-service model constitutes the transplanting of technology that was designed and perfected for use in a well-defined paradigm (a distributed, inescapably physical client-server topology) into a fundamentally amorphous context: that of an elastic, loosely-coupled, multi-tenanted, inescapably virtual topology.
It's a kind of vivisection, if you think about it: you're taking something that evolved in one context -- the client-server environment -- and attempting to stitch it into another, completely different context.
To put it another way, it's like what happens when you grab a hardcover copy of War and Peace to read on a long flight. By the time the flight's over, your hands and arms are sore from propping up or readjusting three pounds of paper. At that point, you're more open than ever to the idea of getting an iPad, Kindle, or similarly lightweight eBook reader.
NuoDB is an eBook reader for the cloud. "There shouldn't be a notion of having to configure a single master versus a bunch of slaves. Everything is equal, things work independently, they fail independently, they can come up and come down independently," says Procter.
The (Distributed) ACID Test
The second problem NuoDB purports to address is a bit trickier to describe. It has to do with durability, complex queries, and a highly distributed architecture. It also has to do with ACID compliance, which Procter says is NuoDB's big differentiator with respect to the NoSQL crowd. Most NoSQL engines eschew ACID compliance in favor of utility or practicality, but it mainly has to do with the challenge of hosting a data warehouse -- as distinct to a vanilla database -- in the cloud.
As a plain database, NuoDB has an intriguing -- and possibly compelling -- story to tell. It claims to be both ACID- and SQL-compliant, which Procter and other officials say makes it a more data management-friendly alternative to NoSQL.
"We've tried to be heavily standards-compliant. We've tried very hard to think about SQL-99 ... and we've been working with a number of our beta customers who ... [are] taking an existing [SQL] database and trying to port existing applications [to NuoDB]," he comments, noting that NuoDB doesn't yet support popular database-specific flavors of SQL, such as Oracle Corp.'s PL-SQL. Procter and his team have likewise had to do a bit of "tweaking" to address some of the quirks associated with RDBMS platforms – particularly MySQL.
"By and large, when a particular customer has come to us and said we need particular [SQL] syntax for particular functionality, we've been able to work with that," he indicates.
So much for SQL. How does NuoDB purport to address ACID in a highly distributed, loosely-coupled context? Probabilistically, says Procter. What matters, he stresses, is that reconciliation ultimately happens. If you deploy a normal distribution of hosts, TEs and SMs, over (usually a very short period of) time, transactions will get propagated. Reconciliation will occur.
"What's going to happen is that as part of a transaction, everyone who cares about the record that I'm updating is going to hear about this update. Eventually, all of the transaction engines that have this object in memory are going to hear about it, so they're going to update their caches, and eventually all of the SMs are going to hear about it, because each one has a full copy of the object," he explains. "In practice, what we're really doing is we're saying, 'I have to get this update out to every transaction engine, although that doesn't have to happen as part of the commit, necessarily, it just has to happen eventually."
This is all fine when it involves simple transactions (a single credit, a single debit). Procter and NuoDB are on less compelling ground when it comes to complex transactions that involve -- for example -- a simultaneous credit and debit to an account.
When industry veteran Mark Madsen, a principal with consultancy Third Nature Inc., asked about this very scenario during a joint briefing with BI This Week and NuoDB, Procter stressed that NuoDB -- in version 0.8 at the time of the briefing -- was still building up to its 1.0 release.
"In order to make sure that this object is really durable, at least one SM has to acknowledge that it has [written] it," he says. "This is one of the things that we're working on trying to figure out how flexible and rich we can make it for our 1.0 release."
There's also the larger question of what kind of data warehouse NuoDB is -- or could be. Right now, Procter concedes, development is focusing more on OLTP, chiefly because that's where most of the pain is, in the cloud environments of today, at least. When it comes to querying -- or to the kinds of complex queries that are common in the DW world -- NuoDB doesn't yet have quite so compelling a story to tell.
"This is a database that's really designed for the kind of relational queries that people have, where there's a mix of reads and writes, where there are going to be some heavier queries [and] some lighter queries. We haven't been trying to optimize yet for some of the more complicated queries," he indicates.
Procter suggests, however, that NuoDB's architecture could confer a notional analytic advantage, too. "One of the things we have been working on ... is we have ... this notion of caching generic objects in a distributed peer system. We have peers that have to know how to coordinate with each other, [which means that they] need to know how to do batch processing," he says. "This kind of batch processing sounds like MapReduce, like applications that take computationally intensive [queries] and split them into smaller queries."
Third Nature's Madsen says he's frustrated with the limitations of existing DW "cloud" services and concedes that he's intrigued by NuoDB: "When I look at cloud databases, I see a lot of people who've created a federated database on top of partitioned MySQL, which is a terrible idea for so many reasons. [NuoDB is] one of the few I've found that's actually designing for a cloud environment to try to solve those problems."
On the other hand, he continues, NuoDB isn't yet ready for the DW prime time. "Most of what [they've] gone over speaks to OLTP, but what about non-OLTP [use-cases] -- for example, query-heavy, query-analysis stuff?"
This aspect of NuoDB is still gestating, he points out. "Lack of a distributed query puts a big dent in the [data warehousing] utility of the database, making it like a NoSQL database, only with SQL that developers [unlike data management professionals] seem not to want," he comments.
Because of this, it's likely that NuoDB can't yet address all query conditions, Madsen suggests.
"Or not," he concludes. "The only way to tell is to try it."