Informatica Takes on Big Data
Informatica says its new Big Data Management offering addresses the three-fold challenge of integrating, governing, and securing data in tandem with big data platforms such as Hadoop.
- By Stephen Swoyer
- November 17, 2015
It's all but official: we're in the post-relational age.
The latest piece of evidence comes by way of Informatica Corp., which this month announced a new Informatica Big Data Management solution.
The mere existence of a product called Informatica Big Data Management doesn't mean we're in the post-relational age, nor does the distinction that Informatica makes between its "traditional" DI tools and its new big data-oriented DI offering. In toto, however, they're pretty strong indications that we're no longer in -- square, tabular, conventional, relational -- Kansas.
Informatica itself seems to get this.
"Over the last two years, as we've released Informatica [PowerCenter] Big Data Edition, we've kind of come to the realization, talking to customers, talking to market observers ... that [there are these] two worlds: the traditional, existing use cases and the next-generation use cases," said Piet Loubser, vice president of platform product marketing with Informatica, during a recent analyst call.
"There's obviously a ton of need still on the existing side ... but there's [a] growing [need] on the other side as well where some existing capabilities are needed, meaning the realization that there's more integration and quality and things that we do [well] needed on that side."
Another way of looking at this is that adopters are no longer willing to give the non-square world of big data a pass when it comes to issues of consistency, reliability, security, and governance.
In other words, says Mark Madsen, a research analyst with IT strategy consultancy Third Nature, data management is coming to big data. It's about time, he argues.
"This is probably the biggest thing you can take away from [Informatica's] announcement. It tracks to what's happening in the market. Just look at [this year's] Strata + Hadoop World [conference] and the sheer number of companies -- including all of the major Hadoop distros -- that were emphasizing basic data management capabilities," Madsen points out. "The subtext of this is that data management is coming to the big data environment. It's no longer about making the [big data] environment actually work anymore. That's what it used to be. The focus used to be on monitoring, scheduling, workflow -- all stuff that makes the environment work. That's what I think is the smartness of Informatica's timing. In essence, they waited until the [big data] platforms are more or less functional.
"Now [people] want to do stuff on them. Now it's becoming a data management problem."
First Up: A Metadata Management Repository for Hadoop
Informatica says Big Data Management addresses the three-fold challenge of integrating, governing, and securing data in tandem with big data platforms such as Hadoop.
Informatica claims that it simplifies the task of integrating data at big-data scale by exposing a visual, point-and-click, drag-and-drop workflow-building environment, as well as by packaging some of its existing collateral (e.g., transformations) and optimizing them for use with MapReduce, Tez, Spark, and other big data processing engines.
"One of the first things we believe we can bring to [big data management] is a visual paradigm for designing these kinds of environments: your mappings, your rules, everything can be done in a visual environment. We can provide templates out of the box that will make it a lot quicker," he observes. "There are hundreds of transformations that we've been building over the years ... [transformations] that will help you to change the shape of data, change the shape of a column, [and] extract [the] value that you want."
Hadoop has serious shortcomings as a data management platform, starting with its inability to manage metadata and track lineage as data is ingested into and changed within the Hadoop environment. Nor does straight-from-Git Hadoop provide any means to map raw data to the source systems or processes that first produced it. Straight-from-Git Hadoop can't sample and profile data, either. To this end, Informatica Big Data Management provides a business glossary, along with visual data profiling and data quality capabilities. "[Metadata management] is probably one of the biggest issues especially here because so little metadata exists and so little descriptions exist of what the data ... entails, so providing a glossary environment on top of this is pretty critical," he says.
Hadoop's shortcomings aren't limited solely to metadata management. If it's difficult to manage metadata and all but impossible to track lineage in the Hadoop environment, it's at least as difficult to monitor and validate data quality in Hadoop and similar platforms, Loubser says.
The irony, he argues, is that quality variance is a potentially huge problem in a big data context. Not only are you "managing" data of different shapes and sizes in a single platform, but you can't reasonably expect to apply the same quality expectations to all of this data. (Data derived from social media or Weblogs is going to be of a fundamentally different "quality" than is relational data from OLTP systems.) Nor can you expect to use all data in the same ways -- e.g., if it contains potentially sensitive information. "What is the meaning of quality in the big data world?" he asks.
"Obviously in a traditional transaction [processing] system, it's kind of easy [to establish quality] because we have these things built in[to the application or database itself]. If it's financial data, we have all kinds of rules and stuff relating to it, [for example], so how do we deal with things in the big data world and how do we deal with the varying shades [of data] quality?"
Big data platforms, particularly Hadoop, have taken their lumps for giving short shrift to security. There isn't much that any third-party product can do to secure data if the underlying platform on which it's hosted isn't appropriately hardened. Loubser acknowledges as much. At the same time, he argues, a big part of the problem is identifying and classifying potentially sensitive data. That's something third-party technology can help with.
He cites Informatica's "Project Atlantic," a technology that it announced (back in May) to help automate the profiling and classification of data sets. "It starts with first of all discovering and understanding and classifying where sensitive data might exist, so it goes," says Loubser. "Next, [security can be addressed by] understanding where the data is being copied to, understanding proliferation, assessing the risk associated with this."
Informatica, like most of its symmetrical competitors, offers data masking technology as part of its traditional DI stack. At their best, masking technologies can identify, redact, or abstract (using hashing algorithms or other technologies) potentially sensitive information, either in situ (when data is extracted into a repository for dev-testing) or in-flight (when data is accessed in real time). Loubser says Informatica's new offering brings some of these capabilities to big data platforms. "[It supports] fixing or masking as you need [it], whether it be persistently masking it in your database in the storage environment or whether it's dynamically masking when data moves all over the place."
Informatica's profiling technology can also be used to generate dynamic mappings, he claims.
"[This is] the notion that when you're building a mapping in real time based on profiling data, based on patterns, we can detect from the metadata that we [had] previously collected -- we can dynamically do things and shape your mappings for you. If there's, for example, a sensitive piece of data, I can automatically insert and recommend a masking routine for you," Loubser explains.
A Work in Progress
Three years ago, Informatica announced its PowerCenter Big Data Edition. That product was more or less what it sounds like: a version of Informatica PowerCenter that used Hadoop's MapReduce engine to do its ETL heavy lifting. At the time, the Hadoop environment was still tightly coupled to MapReduce. In late 2013, Hadoop 2.0 shipped with a new resource manager, YARN, that broke this dependence. PowerCenter Big Data Edition also included a limited license for Informatica's conventional (non-Hadoop) PowerCenter product. In this way, customers could use either a conventional SMP server kit or Hadoop MapReduce to perform their ETL processing.
Perhaps PowerCenter Big Data Edition was a stopgap, something designed to buy Informatica some time to develop a full-fledged platform such as Informatica Big Data Management. Perhaps Informatica, like many other vendors, didn't know what to make of the whole Big Data Thing, of new platforms such as Hadoop and MongoDB (just to name a few) or of new competitors -- such as Cloudera, Hortonworks, and MapR -- that seemed to be trying to usurp its place in the market.
Perhaps it was a little bit of both. More likely, however, it was something else. According to Madsen, Informatica Big Data Management is reminiscent of the transition -- starting 12 to 15 years ago -- from basic ETL tools to full-fledged data integration suites. The latter incorporated metadata management, data lineage, data profiling, data quality, and master data management (MDM) capabilities. "This reminds me of the era where we went from ETL tools to data integration suites or platforms. There's almost the same message, just played out in a slightly different IT environment," says Madsen. "It makes sense. I mean, we were talking about a bunch of changes to the industry at that time in the data management sector. Now it's more of a broader IT thing.
"Back then, data management came to [the way we practiced] data integration. Now [data management is] coming to the big data platforms. Like I said, [Informatica] timed this very well."