Pervasive Software: Big Data-ready, Big Data-willing, and Big Data-able

When you think of movers and shakers in the world of Big Data, data integration (DI) stalwart Pervasive Software Inc. may not be the first name that comes to mind, but that could change soon, thanks to a pair of solutions.

Pervasive -- which used to be known as Data Junction -- says it isn't just Big Data-ready and Big Data-willing. It's Big-Data-able, too.

Dave Inbar, senior director for big data products at Pervasive, argues that the company has been Big Data-able for some time. In fact, Inbar argues, Pervasive first began kicking the tires on DataRush, its massively parallel data integration (DI) technology, almost a decade ago, before "Hadoop" was a household word.

Pervasive (then Data Junction) was working with Yahoo -- which (famously) helped develop Hadoop -- in an effort to address some of the Internet giant's then-enormous data management problems.

"[They said] 'We have these rather big log files and we want to be able to join them and do things with them,'" says Inbar. Once Pervasive got wind of Yahoo's projected data volume sizes, to say nothing of its time-to-process requirements, "it became pretty clear to them and to us that the then-technology approach wasn't going to work," he continues.

"We set up a separate technical team, recruited some specialists in parallel processing, and said, 'Go write the next-generation integration engine for us.'"

That's how Pervasive bills DataRush, which Inbar says was designed to scale up -- linearly, across the increasingly large SMP processor configurations that now ship with commodity servers -- and to scale out, in a massively parallel configuration.

That perspective likewise informs Pervasive's strategy with RushAnalyzer, a new Big Data analysis tool it unveiled in February. RushAnalyzer mixes a drag-and-drop design interface with a scripting engine that supports Javascript and Python. Aside from its scripting component, Inbar positions RushAnalyzer as a no-code-and-go analytic tool for data scientists.

Make that a no-code-and-go massively parallel analytic tool: Inbar says RushAnalyzer effectively parallelizes data mining and predictive analytic processing across a Hadoop cluster. "It is purely the world's first predictive analytics visual tool that runs natively on Hadoop," he says.

In this regard, Inbar positions RushAnalyzer as a replacement for MapReduce. It sounds like a self-serving -- if not quite heretical -- proposition, and it is: Inbar concedes as much, but as Inbar sees it, MapReduce isn't the solution most folks think it is.

Big Data, No MapReduce

For starters, Inbar says, the market -- and journalists -- need to stop conflating Hadoop and MapReduce. It isn't just that they aren't the same thing; it's that the association of Hadoop and MapReduce, which arguably helped fuel Hadoop's rise to prominence, might now be holding it back. Inbar says Hadoop is "a beautiful platform for all kinds of computation" because it addresses several long-standing problems, including "the data distribution [problem], the coarse-grained parallelism problem, and distribution of computation problem."

Inbar -- who isn't afraid to be provocative -- uses particularly loaded language in describing MapReduce and its relation to Hadoop. "MapReduce is, indeed, a chain and shackle in many ways because it forces you to define your compute solutions in particular ways, including shuffling a lot of data around in intermediate systems," he explains.

The next version of Hadoop will actually deemphasize MapReduce. "The guys working on the next version of Apache Hadoop, will release many improvements in the next release," he explains, referring to Apache YARN (Yet Another Resource Negotiator) as a case in point. With YARN, Inbar says, the Apache Hadoop team proposes to "decouple the MapReduce coding paradigm from the Hadoop data management and distributed compute infrastructure."

This is a Good Thing, according to Pervasive and Inbar. It should likewise be a Good Thing for any vendor or service that wants to run on top of Hadoop.

This is something that several existing DI products -- including both DataRush and DMExpress, an ETL engine from SyncSort Inc. -- already claim to do.

In the next release of Hadoop, Inbar claims, "DataRush just becomes another compute-paradigm engine that's managed and visible to everything else in Hadoop, but it gives you [DataRush's] pipelining and parallelism benefits inside the hadoop infrastructure."

If Inbar is provocative on the subject of Hadoop and MapReduce, he's positively iconoclastic when it comes to another predictive analytic technique -- sampling.

In the Big Data world of today and tomorrow, Inbar says, there's no excuse for sampling.

"The traditional approach to analytics was to say, 'It's too expensive to [process] all of the data, let's just [process] a sample and go from there," he explains. "Data science is inherently diminished if you continue to make the compromise of sampling when you could actually process all of the data.

"In a world of Hadoop, [of] commodity hardware, [of] really smart software, there's no reason [not to do this]. There were good economic reasons for it in the past, [and] prior to that, there were good technical [reasons]. Today, none of [those reasons] exists. [Sampling] is an artifact of past best practices, I think it's time has passed."