Machine-Generated Big Data Poses Special Challenges
Machine-generated data poses big challenges that can make it difficult for any relational competitor to be dominant.
- By Stephen Swoyer
- October 23, 2012
Analytic database specialist Infobright is trying to corner the market on machine-generated data. It isn't alone.
Machine-generated data is expected to comprise one of the biggest categories of big data, so challengers abound -- including many of Infobright's established analytic database rivals.
According to some challengers, however, the problems posed by machine-generated data make it difficult for any relational competitor to become dominant.
What's needed, these challengers claim, is an entirely new class of platform.
Infobright Changes Tacks
Infobright says it's serious about tackling machine-generated big data. It plans to release its new Infopliance later this year. The analytic appliance solution scales from between 12 to 144 TB. With load speeds of up to 10 TB per hour (per node) and an architecture that officials claim is highly suited for storing and analyzing machine-generated data, Infobright says it likes its chances in this fast-growing market.
At the same time, concedes president and CEO Don DeLoach, it still has to get the word out.
After all, he acknowledges, prospective customers tend to default to well-known brands.
"We've concluded that there is a market for an appliance for machine-generated data," DeLoach says. "We've seen customers who select something like a Netezza or Teradata, or even an [Oracle] Exadata to store their machine-generated data."
By "machine-generated" data, Infobright means a potentially enormous category: Web logs, network events, call data records, RFID information, and telemetry data generated by IP-enabled end-points or devices. It the very stuff of big data.
"[The Infopliance is] a general-purpose, real-world technology for doing something that's very narrow-scope," says DeLoach. "The market had evolved to the point where there is a very definite need for something like this that was fundamentally not being fulfilled [or which was being] filled at much higher cost to [prospective customers]."
DeLoach says Infobright's architecture -- which, although columnar, is distinctly different from columnar competitors such as ParAccel or the former Vertica -- is well-suited for storing and analyzing machine-generated data. "The mathematics behind our offering has some unique advantages for machine-generated data that actually become a limitation when you get into a general-purpose data warehouse.
"The [intellectual property] associated with how we load the data, how we establish the metadata layer; the lack of any kind of administrative overehead, the aggressive hardware compression -- these are all advantages [that Infobright has relative to its] competitors."
DeLoach points to a trio of immediate drivers -- call data records, Web logs, and network events -- the volumes of which he claims are expanding rapidly.
"There's going to be a proliferation of sensor data. As the market matures from a machine-to-machine standpoint, and as we see things like the smart grid market evolve, I think that you will expect to see more and more of Infobright used in these solutions."
Mark Madsen, a principal with consultancy Third Nature Inc., says Infobright's architecture does confer a few advantages when it comes to processing, storing, or analyzing machine-generated information.
"[Infobright does] some very interesting stuff for data placement, which optimizes for storage and I/O [performance]," he comments. "Infobright is designed for write-once data in [a] simple schema, like big flat log records in big flat tables," he continues, adding that -- in many cases -- data of this kind "essentially" comprises an event stream.
"Those kinds of sensor streams you can do fine with getting, organizing, and querying them, but you can't [easily or efficiently] do much math on them," Madsen observes.
A Post-Relational World?
This is true of any relational database system, says Madsen: unless it embeds analytic routines inside the database engine itself -- much like IBM Corp., ParAccel Inc., Oracle Corp., SAP AG, and Teradata Inc. are doing with their data warehouse platforms -- it's at a computational disadvantage relative to other kinds of data stores, such as vector- or matrix-based engines.
That's the rub. The data or event streams generated by sensors, embedded devices, machines, and other types of intelligent "things" tend to be multidimensional. This data has both spatial and time-series operators. For this reason, some experts argue, it lends itself to a more demanding kind of analytics. Call it Computational Analytics.
That's one reason Infobright and its relational database rivals aren't the only challengers in this segment.
There's another category of database engine: that of what might be called the "baggage-free" computational analytic platform. These databases reject traditional relational architectures. Entrants in this class include StreamBase Systems, VoltDB, and SciDB -- all three of which were developed (or co-developed) by industry luminary Michael Stonebraker -- as well as Paradigm4, a big analytics platform based on SciDB.
Stonebraker likes to describe SciDB as a data management and analytics software system (DMAS), an acronym that he uses to distinguish it from the familiar DBMSes that have long anchored BI and DW efforts. SciDB was designed to power the Large Synoptic Survey Telescope (LSST), an ambitious effort to map the Milky Way, among other goals. When it goes live in 2021, the LSST is expected to generate up to 30 TB of data every night.
As a DMAS, SciDB isn't architected like a conventional (relational) repository. It structures information in terms of arrays and vectors, which proponents say makes it ideal for expressing both spatial and time-series operators.
Paradigm4 extends SciDB with optimizations for or enhancements to the R statistical/programming language, MATLAB, and IDL, along with improved support for procedural languages such as C++ and Python. In addition, Paradigm4 offers management tools and other proprietary add-ons, along with maintenance and support.
"The basic idea here is that relational databases have been around for years [and] have a table data model that was designed for business facts, [which means that] you can't take this [machine-generated] data that's inherently ordered and shoehorn it into a relational database without sacrificing performance," says Marilyn Matz, CEO of Paradigm4. "So [Paradigm4's] basic data model is a multi-dimensional array ... [and] the value of that array preserves the inherent ordering [of data]; if you have spatial data, the [data] to the left of you is [logically understood to be] in the position to the left of you, [both] in storage and in the real world."
Paradigm4, Matz says, benefits from SciDB's architecture, which was designed to accelerate certain kinds of complex operations. "A lot of these math operations ... are actually matrix operations ... and a lot of these problems have high dimensionality, so when you're doing discovery analytics or ad hoc querying, you want to be able to slice, dice, drill down ... without having to set up any indices or doing any tuning: you just want access."
It's in this respect, she claims, that a relational architecture is most limiting. "In a relational database, it doesn't matter if you choose 'row-major [order] or column-major [order], you're still not having this [i.e., Paradigm4's underlying] dimensional model," Matz contends. "A relational database has to store indices; we don't. It's like [it is in] programming: you declare the multidimensional array and you compute where the data is."
Paradigm4 is not a SQL database. It substitutes two languages -- AFL and AQL -- for the RDBMS's traditional dependence on SQL. AFL, Matz says, "looks just like APL," an array-oriented programming language. (APL is an acronym for "A Programming Language;" AFL, on the other hand, stands for Array Functional Language.)
In lieu of SQL, Paradigm4 prescribes AQL, or Array Query Language, which Matz says "people coming from the SQL analytics world are more comfortable with."
Another way in which Paradigm4 (and SciDB) differ from relational databases is in how they handle missing data values, or NULLs. According to industry veteran Mark Madsen, a principal with consultancy Third Nature Inc., data management (DM) practitioners will use a number of methods – e.g., they'll compute a moving average – to handle NULLs.
For many applications, however, this approach has problems, Madsen says. "I can substitute context easily using just a moving average and [I won't have any] problems in merchandising, but you can't do that in a risk calculation; you have to have a mathematically valid way [of computing a missing value], so you would have to sort of custom-write an algorithm," he comments. "A lot of people can't even use a data warehouse today for machine learning stuff, or basic statistics. In fact, when people are doing hardcore stuff, many times they end up bypassing [the data warehouse] and going to the raw data anyway."
SciDB and Paradigm4 offer users more flexibility in this regard, Matz maintains.
"SQL semantics has one notion of NULL, and that doesn't cut it, so what we have is ... an unlimited number of codes [that you can use] instead of NULL so that what people really want to do is to do context-specific substitution," she explains.
For the risk calculation that Madsen invokes, for example, a user could assign a custom-written algorithm to a specific code, which -- depending on the context -- would be substituted when or where appropriate. "I might be in one kind of use case or query ... where ... if there's a missing value I might want to fill in the spatial average," Matz concludes. "We're able to support multiple flavors of NULLs" by means of substitution codes.