Exeros Touts an Rx for Metadata Babel
How can companies automate the heavy lifting of metadata integration and entrust the rest to domain experts who really know the data?
- By Stephen Swoyer
- May 17, 2006
Do you know where your metadata is? More importantly, do you know what your metadata is? It’s a surprisingly tricky question, even in an age of meta-metadata elixirs such as customer data integration (CDI), master data management (MDM), or reference data management (RDM).
It’s a problem for which there are no easy answers, acknowledges Alex Gorelick, founder and CTO of data integration and meta-metadata management specialist Exeros Inc. But it’s not insoluble. There are things companies can do, and technologies they can tap (such as, Gorelick argues, Exeros’ DataMapper tool), to help them separate the wheat from the chaff—or the customer information from the customer disinformation—and vice-versa.
Gorelick knows whereof he speaks. A decade ago, he co-founded the former Acta Technology Inc., a prominent ETL and SAP integration specialist which was acquired by Business Objects SA nearly four years ago.
He says the metadata reconciliation issues which today confront organizations are substantive and pervasive, so much so that metadata management tools, and MDM, RDM, or product information management (PIM) technologies are merely stop-gap measures for dealing with the problem.
“What I realized is that a huge problem in the field isn’t really moving the data, it’s figuring out what to move and how to move it,” he comments. “People are spending all of their time struggling with the data and trying to figure out what does the data mean. Most of [their solutions are] very, very ad hoc.”
Consider metadata labeling tools, for example: “You’d have a bunch of data experts, domain experts, and business analysts together in a room for six months arguing about labels, but labels are misleading,” Gorelick argues.
He cites the example of credit card transactions, which—because of resource considerations, programming constraints, changing company policies, or merger and acquisition activity—are rarely uniformly formatted: “A lot of credit card transactions have overloaded fields. [For example,] if you have an online transaction, you might use the zip code [field instead] to have the URL of the Web site, because it’s online and there isn’t a zip code.”
What about data profiling tools? Isn’t this the kind of scenario for which such solutions are tailor-made? Not necessarily, Gorelick argues. “Why don’t we use profiling tools? Well, [when] I run profiling I got a very nice report that told me I had thousands of columns with integers going from 1-10, and on the other side I got reports [that tell me basically] the same thing. It just isn’t clear what to do with it,” he points out. “For example, distributed and mainframe columns different, long for the former; really short abbreviations for the latter [because of resource constraints]. You really need to understand the context to understand the relationship between the data.”
The rub, Gorelick says, is that data profiling and data quality tools are most helpful when they’re used against a single source—not as a means to cleanse and integrate multiple heterogeneous systems.
“After a couple of years of struggling with these [issues], I realized it has to be data-driven. Our technology originated trying to automate this process and build a methodology around it. But there is no one methodology. You can’t go to Amazon and buy a book on data mapping. Not only are people struggling to try to do this manually, there is no guidance for them to do it. A data-driven approach where people can automatically discover the mappings and simultaneously build a [custom] methodology around this is the best [overall] approach.” DataMapper helps companies do just that, says Gorelick. Metadata integration and reconciliation can’t be rushed, he asserts. There is no “automatic” or “turnkey” deus ex machina that helps rapidly deliver a company from the polyglot hell of metadata Babel. For this reason, he stresses, Exeros assumes that metadata and domain experts, business analysts, and other knowledgeable stakeholders will be actively involved in the data integration process.
“What I realized is they’re really doing more or less the same thing. They look at all the data. They try to group it into business entities, and then they narrow it down. There is a methodology they use… [and] 80 percent of the methodology is repeatable,” he indicates. “People can’t print millions of rows and go through it with a highlighter; it’s just not practical. But computers can. Computers can go through [data] with semantics. By giving the analyst the tool and the methodology, we make it easier [to identify integration issues]. We don’t eliminate them. That’s what the analysts are paid to do.”
The typical DataMapper implementation involves an analyst and her workstation. Jane C. Analyst interacts with Exeros’ Mapping Studio, Gorelick explains, which (ultimately) processes and renders the discovery information that’s unearthed by DataMapper’s spider search agents, which are called DataBots.
Discovery can take anywhere from several minutes to several days, depending on the complexity and heterogeneity of a company’s infrastructure. DataBots don’t use canned connectivity to get at data sources; instead, they tap ODBC or direct file access. There’s a reason for this, Gorelick argues.
“We can’t rely on semantics. With Acta, [we were able] to prepackage semantics for SAP. Unfortunately, [packaging semantics] breaks down when you’ve got lots of legacy stuff. You can only invest in [connectivity] for the most common uses, the most common modules. We can’t rely on semantics for all of these different systems. We just connect to every database through ODBC, or to any file, and we use metadata to express this information,” he indicates.
In this respect, Gorelick says, DataMapper’s methodology is straightforward: “Given the source and the target, we will try to generate the target by the source. Given a source table of files, and given a target table of files, for each target, we’ll figure out how to generate from the source.”
Enter Jane C. Analyst. “The analyst schedules discovery. It might run for hours, sometimes even for days, if you have enough. Comes back with results, shows the analysts the data, [which includes] all the possibilities that it found. It identifies potential mismatches, for example, such and such only matches 80 percent of the time. Then the analyst can look at it and say, ‘I want to do discovery on these accounts,’ [so she can] schedule deeper discovery.”
This helps accelerate the process, Gorelick says—although it’s still far from automatic, he stresses. “It’s an interactive process where the discovery, the heavy lifting, is done in the background, and then the analyst works with the results,” he explains. “For the analyst, you need to pull a lot of information together in something that, again, is usable. We’re typically at least five times faster than by doing it by hand. If we said 50 times, people would laugh, of course, but that’s closer to the truth. Imagine doing the heavy-lifting by computer [with a repeatable methodology], instead of going through all of this data with ad hoc tools—or with a highlighter.”