RESEARCH & RESOURCES

Informatica Takes on Big Data

Informatica's PowerCenter Big Data Edition is more than just a Hadoop DI tool: it even includes vanilla PowerCenter.

At the Hadoop World conference in New York, Informatica Corp. announced a new big data-ready version of its PowerCenter data integration (DI) offering, the aptly-named PowerCenter Big Data Edition.

Just as PowerCenter itself has evolved into more-than-just-an-ETL tool -- it bundles data profiling, data cleansing, and other features -- PowerCenter Big Data Edition might be called more-than-just-a-Hadoop-DI tool: it even includes vanilla PowerCenter.

According to John Haddad, director of product marketing with Informatica, the new big data-ready version of PowerCenter bundles features such as data profiling, data cleansing, data parsing, and "sessionization" capabilities. PowerCenter Big Data Edition also includes a license for conventional PowerCenter, says Haddad; this permits customers to run ETL or DI jobs in the context -- i.e., in Hadoop or on one or more large SMP boxes -- that's most appropriate to their requirements or workload characteristics.

"It includes the license and [the] capability to run traditional PowerCenter and scale it up on multiple CPUs like an SMP box or on a traditional grid infrastructure," he confirms. "You're not going to use Hadoop for all of your workloads; if you're doing a few gigabytes of structured data on a daily basis and you want it to be processed in near-real time, you would deploy that on a traditional grid infrastructure," Haddad continues. "If the next day, you have 10 terabytes of data and you need extra processing capacity, you can run that in Hadoop."

Accommodating Hadoop

Vendors are accommodating Hadoop in different ways. DI vendors, for example, tend to take either of two approaches.

Some vendors have gone "all-in" on Hadoop and MapReduce -- the approach leverages the Hadoop implementation of MapReduce to perform the processing associated with ETL workloads. Open source software (OSS) DI specialist Talend is an example of this approach.

Other vendors have employed an embrace-and-extend approach. DI offerings from vendors such as Pervasive Software Inc. and Syncsort Inc., for example, run at the node-level across a Hadoop cluster; they use their own libraries in place of MapReduce, such that a Pervasive or a Syncsort engine actually does the ETL processing in place of MapReduce on an individual Hadoop node.

Informatica's approach is closer to that of Talend's -- with a key difference. In the context of Hadoop, PowerCenter Big Data Edition -- like Talend Open Studio for Big Data -- uses MapReduce to do its ETL heavy lifting. However, customers alsocan run non-Hadoop workloads in conventional PowerCenter. (The Big Data version of Talend Open Studio does not include a license for conventional -- i.e., non-Hadoop-powered -- Talend ETL. If you buy Open Studio for Big Data, you're using MapReduce to do your ETL processing.)

"Hadoop is not for all types of workloads and we recognize that. In some ways, the Big Data Edition is elastic. Even if you're doing a big data project, you're clearly going to want [to involve] some of your more traditional [data] sources, too," says Haddad, who adds: "Don't you want one package that can do it all?"

Haddad and Informatica aren't necessarily insisting on an arbitrary distinction. Some critics allege that although MapReduce-powered ETL is a good fit for certain kinds of workloads, it makes for a comparatively poor general-purpose ETL tool.

"[MapReduce] is brute force parallelism. If you can easily segregate data to each node and not have to re-sync it for another operation [by, for example,] broadcasting all the data again -- then it's fast," said industry veteran Mark Madsen, a principal with information management consultancy Third Nature Inc., in an interview earlier this year.

The problem, Madsen drily noted, is that this isn't always doable.

Haddad acknowledges that most of Informatica's competitors market Hadoop- or Big Data-ready versions of their DI platforms. On the other hand, he insists, PowerCenter Big Data Edition supports both Hadoop MapReduce and conventional ETL. For this reason, and in view of the shortcomings of MapReduce-powered ETL for certain kinds of workloads, Informatica's is the more "flexible" approach, Haddad claims.

"As companies move more of their workloads to Hadoop, you don't want them to go back to the stones and knives of hand coding," he points out, "so we provide the ability to remove hand coding within Hadoop for ETL and things like that. We also make it possible for [customers] to design and build [DI jobs] once and deploy [them] anywhere: on a traditional grid or on Hadoop."

TDWI Membership

Get immediate access to training discounts, video library, BI Teams, Skills, Budget Report, and more

Individual, Student, & Team memberships available.