TDWI Articles

In the Middle of DatA Integration Is AI

The connection between data integration and artificial intelligence is growing.

Have you never noticed that acronym hiding in plain sight in the midst of data integration? Artificial intelligence (AI) can be like that -- it creeps up on you unnoticed until suddenly it's all over the place.

For Further Reading:

AI and Analytics: The Rise of the Machines

Machine Learning and the Ingestion of Data

Insights and New AI at Informatica World 2017

How Are Data Integration and AI Related?

One of the early occurrences is in a 2013 research paper: "Data Curation at Scale: The Data Tamer System" by Michael Stonebraker and others. Although labeled as curation, the topic is largely the same as data preparation, data integration, data unification, ETL, MDM, DWA, or whatever you choose to call it. In essence, it is the set of processes needed between the multiple, inchoate data sources of a modern business and any cohesive system claiming to deliver consistent insights from them.

According to the authors, "At ... scale, data curation cannot be a manual (human) effort, but must entail machine learning approaches with a human assist only when necessary." This MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) research was productized in 2014 as Tamr.

Combining machine learning (AI) and human assistance makes sense in the context of data integration. Training an unsupervised AI system requires enormous amounts of data, even where the approach is technically appropriate. Supervised learning -- human assistance in tagging the training set -- is often more effective for situations where training data is more limited. In data integration, where data volumes are more limited, human audit and correction after AI training is the likely scenario.

Metadata, Integration, and AI

The Data Tamer (and Tamr product) methodology explicitly calls out three levels of context-setting information that are encountered in data integration and how they can be addressed:

  1. Complete knowledge -- a predefined, human-provided model or schema is used top-down to characterize the content
  2. No knowledge available -- context-setting information is inferred by AI entirely bottom-up from the content
  3. Partial information available -- a combination of top-down and bottom-up is used

Machine learning aside, this approach dates back to the earliest days of data warehousing.

Dr. Stonebraker's most recent white paper extends his thinking to "The Seven Tenets of Scalable Data Unification." Here, he claims that traditional data integration approaches miss most or all of the listed conditions, while Tamr (unsurprisingly) meets them all. Longtime practitioners of data warehouse population are unlikely to agree with the first statement. Each tenet has been considered and implemented in some form or another by data integration products as technology allowed.

The Scale of Automation

As AI plays an increasing role in data integration, the important questions become how completely can the process be automated and what degree of human assistance is still required.

In a recent briefing, Informatica AI expert and product manager Einat Haftel asserted that their recently announced AI-based functionality, the expressively named CLAIRE -- Cloud-scale AI-powered Real-time Engine -- can operate without human assistance already in some areas and will soon do so in others. Her ability (yes, let's anthropomorphize as the name seems to intend!) arises directly from the amount of metadata Informatica has gathered over its many years in the data integration business.

Data integration, governance, and subsequent data discovery by the business is dependent on an information catalog, which defines and describes common (and less common) data names, meanings, and usage for the enterprise. This clearly starts with human definition, sometimes through data governance initiatives or simply by encouraging business experts to document their data.

However, with half a century of cataloging and modeling business data across all industries, a sufficient corpus of training data possibly exists to enable AI to automate the creation of business-specific information catalogs. Within the coming year, CLAIRE will be able to identify and catalog information using natural language processing and other AI techniques without the need for human intervention, although such manual control is, of course, allowed.

Informatica already offers Intelligent Structure Discovery, a facility for deciphering the type of nonstandard, comma-delimited, and often header-free data structures sourced from Internet of Things and other machine-generated data streams. First, a learning algorithm runs independent of human input to understand and classify an unknown input structure, detecting its component parts and using the result to extract and normalize data from it.

The result is an "intelligent parser" that uses this model to perform fully automated run-time transformations on files of similar structure. If desired, business users can view the model in a table or tree format and refine it: renaming elements, normalizing the structure, and so on.

The Future of Human Input

With CLAIRE, Informatica claims that AI can fully automate a significant portion of the data management, governance, and integration work of the enterprise. This largely eliminates the need for human intervention -- both business and IT -- in these areas, delivering significant productivity gains. Human involvement is at the discretion of the business.

There has been a significant shift in emphasis in only four years from Tamr's "use AI with human assistance" to Informatica's "AI needs ever decreasing human input" in data integration and management.

The earlier position promises productivity improvements. The latter, in addition, poses questions of trust: How far are we willing to go in handing control of data governance to artificial intelligence? Will AI-based data integration consistently perform better than a combination of business and IT experts? Is data management via AI compatible with the idea that information should be a source of innovation in business?

About the Author

Dr. Barry Devlin is among the foremost authorities on business insight and one of the founders of data warehousing in 1988. With over 40 years of IT experience, including 20 years with IBM as a Distinguished Engineer, he is a widely respected analyst, consultant, lecturer, and author of “Data Warehouse -- from Architecture to Implementation" and "Business unIntelligence--Insight and Innovation beyond Analytics and Big Data" as well as numerous white papers. As founder and principal of 9sight Consulting, Devlin develops new architectural models and provides international, strategic thought leadership from Cornwall. His latest book, "Cloud Data Warehousing, Volume I: Architecting Data Warehouse, Lakehouse, Mesh, and Fabric," is now available.


TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.