TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

TDWI Articles

00 Days

00 Hrs

00 Min

00 Sec

Taming the Wild Profusion of Data Sources

New solution Tamr learns over time; it's an intelligent midwife for data source and information discovery.

By Steve Swoyer
April 28, 2016

Are you a large enterprise? Do you have a profusion of different data sources? Can you actually account for all of the data sources you do have? You can? All of the data sources? (Don't forget spreadmarts, rogue data marts, isolated instances of Tableau, and obscure legacy systems.) One more thing: Do you think there might be a few data sources you don't know about? You aren't sure? Then you're a prospective customer for Tamr.

If you think about it, Tamr's pitch is a spin on Meno's Paradox. You have your known data sources, you have your known-unknown data sources, and you have your unknown-unknown data sources.

You live and work with (and around) your unknown-unknown data sources every day. You've been living and working with (and around) them. How can you discover them? How can you find out about them and gain insight into them?

More to the point, you have questions: about products, about suppliers, about customers. Your data has answers, but you don't know where, in which silo, those answers lie. You don't even know about all of the silos in which they could lie.

It's not that different from the problem Meno posed to Socrates.

There's a delicious Socratic irony here, too. With the known data you have now, you have the capacity to ask certain kinds of questions, but you don't know how to ask others. You can't bring all of your data together in a single context -- or otherwise make it available in a broad view -- so you aren't even aware that there are other questions.

To discover new knowledge by identifying and unifying data sources throughout the enterprise is to discover new kinds of questions.

"[W]e have answered a variety of questions for our customers. For example, in procurement, we've answered, 'What are the top suppliers for a particular part? How do I get a part at the best price?' In customer data, we've answered questions around, 'Which are the right customers to cross-line and upsell to?'" Shobhit Chugh, product lead and entrepreneur with Tamr, told a gathering of the Boulder BI Brain Trust (BBBT)in a January presentation.

"The problem is ... that data is siloed across different parts of the enterprise. What we help companies do is unify this data from all these different parts of the enterprise."

Fundamentally, as Chugh says, this is an issue of silos. Siloed data is an age-old problem, however. In its pre-digital forms, it's as old as Socrates, or older. Why do Chugh and Tamr claim to have it licked? It's at this point that company officials tend to invoke the name of database design and information management genius Michael Stonebraker.

Tamr, you see, is yet another Stonebraker project. (There should be an acronym for this: YASP. From Ingres to Postgres to Streambase to Vertica to VoltDB to SciDB, Stonebraker is the Johnny Startup-seed of his generation.) The magic of Tamr's pedigree notwithstanding, its software relies less on pixie dust than on smart technology.

Tamr, the platform, was developed in part at MIT, where Stonebraker teaches. It's a combination of data source discovery and rich metadata management technologies, machine learning algorithms, rule-driven automation technology, and human expertise. Put it all together, Chugh claims, and you have a formula for taming the wild profusion of existing data sources.

Socrates used to describe himself as a midwife of knowledge. As he saw it, his job was to help "birth" or "deliver" knowledge by asking questions. Tamr does something similar: it generates "Yes" or "No" questions, puts them to human specialists, captures and codifies their expertise, and feeds this back into the system.

In other words, Tamr learns over time; it's an intelligent midwife for data source and information discovery. It gets smarter -- more efficient -- with use.

How much more efficient? Purveyors of business intelligence and analytics technologies like to quantify their success. They often claim to eliminate 80 percent of the work -- or time, or money, or all three -- involved in performing a task or completing a project. Tamr goes further: it claims to eliminate 90 percent of the time and effort involved in identifying, cataloging, and integrating -- "unifying," in Tamr-speak -- enterprise data sources.

"[We] use a combination of machine learning and then input from people. We are able to create a very small training set, ask questions of people, and from that... learn and then automate a lot of these processes, resulting in a huge productivity improvement," Chugh told the BBBT.

By indexing and building its metadata catalog -- the aptly named "Catalog" -- Tamr's data discovery and access technology (Tamr Connect) can reference and relate transactions and records across multiple data sources. Chugh claims it can also effectively cleanse and standardize critical attributes, including customer, supplier, or product names, IDs, SKUs, and addresses.

"Catalog is a way to organize metadata about all the data sources that exist within the enterprise. It came from a question that we had from customers, that, 'Hey, you know, I understand unifying these data sources, but really, I don't even know where all these data sources are, and I don't have enough knowledge about it," Chugh explained. "Catalog is a collaborative tool that helps people add in information about these sources, discuss, and build this knowledge about data sources over time."

Tamr Connect uses "fuzzy" matching to reconcile and associate inconsistently named products, parts, suppliers, customers, etc. Because of its human feedback loop, whereby it learns from and generates rules based on feedback from human experts, its "fuzzy" accuracy improves over time.

Tamr Consume, the final piece of the platform puzzle, is both an information repository and a REST gateway. REST -- representational state transfer -- is the application architecture of the loosely coupled cloud. Consume permits analysts and other knowledgeable users to expose data sets as RESTful services, which means they can be shared with other REST consumers.

Above all, Chugh claimed, Tamr is a technology for identifying and -- to the extent practicable or desirable -- capturing certain kinds of human expertise. Once you capture and codify this expertise, it is possible for you to automate certain kinds of decision making.

"Another big part of this is this information does not exist in a single person's head. It's spread across the enterprise. There are various people who are experts in different parts of the data," he concluded. "A big part of Tamr is collecting this feedback from these different experts, learning which experts know more about certain kinds of data, and then directing appropriate questions to them based on that knowledge."

About the Author

Stephen Swoyer is a technology writer with 20 years of experience. His writing has focused on business intelligence, data warehousing, and analytics for almost 15 years. Swoyer has an abiding interest in tech, but he’s particularly intrigued by the thorny people and process problems technology vendors never, ever want to talk about. You can contact him at [email protected].

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.

TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Research & Resources

Webinars

Virtual Summits

TDWI Articles

Taming the Wild Profusion of Data Sources

Related Articles

Trending Articles

From Reactive to Proactive: Automating Data Quality in Petabyte-Scale Analytics Pipelines

From Pilot to Production: Why LLM Features Stall, and a Readiness Checklist for Data Leaders

The Inferencing Cost Problem No One Is Talking About: Unstructured Data Quality

The Hidden Cost of Poor Training Data in Generative AI

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI

Engage

Research

Research & Resources

Webinars

Virtual Summits

TDWI Articles

Taming the Wild Profusion of Data Sources

Related Articles

Trending Articles

From Reactive to Proactive: Automating Data Quality in Petabyte-Scale Analytics Pipelines

From Pilot to Production: Why LLM Features Stall, and a Readiness Checklist for Data Leaders

The Inferencing Cost Problem No One Is Talking About: Unstructured Data Quality

The Hidden Cost of Poor Training Data in Generative AI

TDWI Membership

Accelerate Your Projects, and Your Career

TDWI

Engage

Research

Accelerate Your Projects,
and Your Career