Aerospike Accelerates Specialty Analytics
Upstart player Aerospike is betting there's a market for a highly specialized analytics platform.
- By Stephen Swoyer
- February 10, 2015
Upstart player Aerospike Inc. is betting that there's a market for a highly specialized analytics platform. How highly specialized? Aerospike's NoSQL database -- dubbed, appropriately enough, "Aerospike" -- is optimized for so-called "transactional analytic applications."
Just what those apps are and why they're important is a bit complicated.
"The problem we started out with was that it's hard to build a high-scale app on the Internet that would actually stay up and available. The primary problem in our view is the data-store layer, and we felt that the best solution [to this problem] was to focus on key-value [pairs]," explains Brian Bulkowski, Aerospike's CTO. "In our speak, a 'database,' is really a primary key. If you need to go beyond that [with user joins and other things], three's usually going to be an amorphous penalty that you don't understand. Key-value pair is where predictability is."
What does this mean? Why key-value pairs? What kinds of problems is Aerospike trying to address?
Part of the answer has to do with how advertising analytics works. Starting in around 2010, the ad analytics market switched from storing session information on the client (as "cookies") to storing it on both the client and the server. There's a lot more going on in the background, but from a data management (DM) perspective -- and to Bulkowski's point -- most of this server-side session information is written as key-value pairs. Most Web-serving workloads (like most non-analytic workloads, for that matter) are read-intensive. Generally, speaking, you're going to be reading from disk (or Flash or memory) exponentially more than you're going to be writing to them. Not so with server-side sessionization, which requires scalable, 50/50 read and write performance.
"The advertising guys said, 'We need something just like this that does 50/50 read-write workloads, and something that has a baked-in level of reliability.' When they thought about it some more, they [decided they] wanted something that could use shared, attached storage, too," Bulkowski explains, noting that he and co-founder Srini Srinivasan had built just a system in "Citrusleaf," a distributed, fault-tolerant, NoSQL key-value store optimized for Flash (i.e., solid-state disk, or SSD) storage. When they thought more about the problem, however, Bulkowski and Srinivasan realized they'd need more than just a Flash-optimized data-store layer.
"We said, 'We really do think it makes sense to do a little analytics on that front-side store without ETL-ing [it out to an analytics platform], so we built out capabilities to do that," he explains.
Enter "Aerospike," which is a hybrid or melding of the former Citrusleaf Flash-optimized key-value store and an open source software (OSS) database called AlchemyDB. AlchemyDB, which Aerospike acquired in 2012, is based on the popular Remote Dictionary Server, or Redis, an OSS key-value pair data store.
AlchemyDB is much more than just another key-value store, however. As a NewSQL eventual consistency database, it implements a SQL-like language that permits it to process SQL queries. AlchemyDB can also implement graph functions using both SQL (for indexing) and an open source scripting language called Lua, which is used to express graphing logic. Add it all up and you have a database engine -- Aerospike -- that Bulkowski says can process "real-time analytics workloads for applications that require millisecond or sub-second response times.
"We're the hot data store, the thing on the front side of an application server. There are going to be a whole bunch of databases behind us, and they're going to be used for all sorts of different analytics [i.e., workloads]. If you think about the kinds of queries you want to run on the front-batch [i.e., Aerospike], they're usually pretty simple. They usually don't have a lot of complicated join information. For example, I know we have 30 days of data in that front-end store, but in the last day or two, what happened to this audience or this advertising campaign or this pool of users?"
Think of this as "single-column" (but non-columnar) analytics. Because Aerospike is trying to solve a highly specialized problem, it does things -- or makes assumptions -- that would make traditional data management (DM) practitioners uncomfortable. For example, it doesn't do data validation -- at least not on ingest. "There's the data validation portion of schema, and then there's the I-have-to-know-how-to-index-if-I have-an-index portion," Bulkowski notes, explaining that Aerospike builds secondary indexes and can also index on column values. "For the first side, we don't do data validation on input -- but the second side, which is schema management for the purposes of indexing, we have it. We have a SQL-like tool that maintains a catalog table. There's contention resolution."
Aerospike also exposes a SQL query interface. Bulkowski argues that SQL's usefulness is radically underappreciated, at least among traditional or Web application developers. "We think SQL is the most natural way of expressing a lot of different queries, including streaming."
Two Different Visions of the Future
Bulkowski positions Aerospike as a kind of point solution. It's designed to address a very specific problem -- namely, real-time or transactional analytics -- which imposes hard requirements with respect to performance and availability. Web advertising is a good example -- e.g., Aerospike works in the background to figure out which ads to serve up based on a person's browsing history -- but Bulkowski cites similar requirements in utilities, financial services, manufacturing, and other markets.
To this end, Aerospike last year significantly ramped up its sales and marketing, appearing at several industry trade events, including O'Reilly Inc.'s Strata + Hadoop World in New York. BI This Week caught up with Aerospike at last summer's O'Reilly Open Source Convention (OSCon), in Portland Ore. At OSCon, Bulkowski answered lots of questions about his company's embrace of OSS, which had occurred just one month earlier. "We're pretty happy having satisfied this performance-intensive niche in advertising that we have a code base that's really, really hardened. We want to go wide with this, [and] that requires an open source model," he said, explaining the move.
Aerospike's specialty pitch flies in the face of the irrepressible human demand for an all-in-one fix for all possible workloads (or for all workloads in a related domain). What we're seeing with some takes on the Hadoop platform -- e.g., Cloudera's Enterprise Data Hub vision -- or with megafauna-like systems from Oracle Corp. (Exadata) and SAP AG (HANA) are articulations of this all-in-one fixation. These and other technologies are still too shortsighted, however, Bulkowski argues. After all, he points out, we're still figuring out what we're going to do with data. How are we to know what the data architecture of the future -- or of 10 years from now, for that matter -- will look like?
Why, then, should BI or data management practitioners care about Aerospike? On the one hand, Bulkowski argues, it's optimized for a problem -- viz., real-time analytics, be it in the context of sales or marketing campaign optimization -- that few other offerings can credibly address. On the other hand, he says, it's a fast, in-memory engine that can accelerate certain kinds of NoSQL and traditional SQL analytics. Insofar as it exposes a SQL interface, it can be accessed and queried by traditional BI tools. Above all, he claims, Aerospike is just one of several critical components of a next-generation data "layer."
His is a vision of a layer of optimized engines for specialty processing and of one-size-fits-most engines for general-purpose processing. Data is vectored to where it needs to go -- to a platform such as Aerospike for certain kinds of transactional analytics, to Hadoop for long-term storage in HDFS, to the data warehouse -- and data movement is itself minimized.
"We're not going up against Hadoop. I think my view of the back-end analytic space is [that] HDFS [i.e., the Hadoop distributed file system] is going to eat the world but that Hadoop is not. Hadoop is just one style, so to speak. Sometimes I want key value [storage] on my ... petabyte data set, and then I go through HBase. Sometimes I want to use something like a Spark-style streaming that's going to mate nicely with HDFS. There's not going to be one query layer and one app layer. There must be one data storage layer because we can't ETL anymore, and it's driving everybody crazy."