Data Integration and Analytic Heterogeneity
If you want to understand data integration in an age of analytic heterogeneity, you must follow the process: process movement, not data or workload movement, is where it's at.
- By Stephen Swoyer
- April 8, 2014
When it comes to data integration (DI), the industry's fixation on data movement misses the point, says Rick Glick, vice president of technology and architecture with Actian Inc.
"I think that we need to get beyond looking at integration as [a question of] just data flowing between systems and start talking about process movement, [which means] driving processes to the right place in your environment," says Glick, who was CTO for analytic database specialist ParAccel Inc. "There's lots of interesting engines with lots of interesting capabilities, and in most cases, you're going to want to use best-of-breed -- the best [engine] for the purpose. This may or may not be what vendor [with which] you've spent several million dollars wants it to be."
As Glick sees it, this has to do with the heterogeneity of advanced analytics, which requires more of DI than did traditional business intelligence (BI) and data warehousing (DW).
Data integration is an enabling technology for both BI and analytics. DI for traditional BI is a relatively straightforward proposition: its focus is the DW, which is also (in most cases) its terminus, too. The BI or analytic discovery use cases change this, but they're still working almost exclusively with SQL or semi-structured data. DI for advanced analytics is a very different proposition, however: advanced analytic processes tend to consist of multiple analytical workloads and mix traditional structured (SQL) data with multi-structured data from semi-structured (machine logs, event messages), semantic (texts, e-mail messages, documents, blog postings), and file-based (audio and video files, etc.) sources. All of this data must somehow be staged, transformed, and prepared for initial analysis, which -- by definition -- is itself a mere prelude to additional analysis.
Hence the emphasis on process: from a DI perspective, an analytic process will be staged, transformed, and moved multiple times, for multiple kinds of analysis, with (usually, but not always) a goal of producing smaller and more refined data sets. Movement is a part of this, but not the most important part: in traditional BI, by contrast, DI is its own discrete process.
"You see people rushing to put SQL interfaces on other databases, to make it easier to get at [access, manipulate, move] data, but this kind of misses the point: it's not really about the language, although I do think it's better to not have to have hoards of [Java] programmers and [be able] to get a little leverage from the SQL ecosystem, but it's not really about language, it's not really about making it easier to get at the data -- it's about having a diverse set of capabilities [on different systems] working together to make it easier for data to flow [between systems] as part of process."
Automating the Process
When Actian acquired ParAccel last April, Glick got a chance to make this vision a reality. He points to Actian's acquisition of the former Pervasive Software Corp., in early 2013, which gave Actian best-of-breed data integration (DI) technology. In DataRush, Pervasive had developed a DI and analytic technology that could run natively (as a parallel processing engine) across the Hadoop distributed file system (HDFS), as well as on its own -- i.e., as a traditional ETL platform, albeit one that's able to scale linearly across dozens of processor cores in large SMP system configurations.
DataRush, which Actian has rechristened "DataFlow," takes this one step further by embedding in-process analytics in its DI routines. This was a step in the right direction, argues Glick.
"DataFlow is kind of an interesting piece because it has a bunch of data mining algorithms, a bunch of transformational algorithms, and a bunch of connectivity to a variety of data sources. However, the most important part of this is that [DataFlow] can reside on the same node where that data is sourced," he points out. To illustrate what he means, Glick uses a "bogus" -- i.e., oversimplified -- example involving a Cassandra data store.
"Let's say I'm doing something inside of Cassandra that's a bit OLTP-ish in nature, because Cassandra is really good at that kind of stuff, but I want to take that and do a regression [analysis]. Dataflow allows us to read from Cassandra and do a regression in the same physical platform, then we can take that and join it with some things going on in Matrix."
Ideally, Glick explains, all of this would happen automatically: the process itself is automated such that workloads get scheduled and kicked off (in ordered sequence) on separate systems, data gets moved from one platform to another at the right time, and so on. Eventually, a subset of the data in Cassandra gets moved to -- or persisted -- in Matrix, which is what Actian now dubs the former ParAccel massively parallel processing (MPP) database. From Actian's Perspective, Glick says, it could just as easily be moved (or "flow to") an Oracle, IBM DB2, or Microsoft SQL Server DW, too.
"In this example, I'm driving dataflows in Cassandra, Dataflow, and Matrix and I'm able to move around a minimum set of data to give me an answer. The process is automated and has fewer moving parts," he explains. "This is a far different story from, 'I do some ETL work in Cassandra and then pipe [that data] into R, where I do a regression, then I take the regression and ETL that into another platform.' A lot of work today is focused on doing these sort of spoke integrations, but not really pushing process along. Process is subordinated to architecture."
This is still just an ideal vision, Glick concedes: ParAccel's been a part of Actian for just under a year, Pervasive for slightly more. Actian is still fitting pieces together, engineering and coding, and so on, says Glick -- but intra-process data flow of this kind is the ultimate goal. It even has an irresistible logic: run the constitutive parts of an analytic process where it's most cost-effective to do so -- with "cost" understood as a function of processing and storage requirements, data movement, and, of course, time. "Today, it's a hybrid. Software will figure out some of this for you, tools will figure out some of this for you, but the person using the tools will have to figure out most of it," he comments.
"The plan is for the software to ultimately be smart enough to do this for you. You'd say, here are my interfaces, fire off a job, and the system will figure out the best way to make that work, based on the typical cost and actually based even on the platform costs of the [constitutive] workloads."