TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

TDWI Articles

00 Days

00 Hrs

00 Min

00 Sec

Merging Multisource Customer Data

Finding common attributes to join customer data across multiple data sources is extremely important, but it can be time consuming and challenging. Thankfully, there are methods and tools that can help in this process.

By Troy Hiltbrand
June 6, 2016

When dealing with customer data from multiple sources, analytics practitioners know that one of the most challenging activities is unifying the data so that there is a single profile for each customer. The representation of the customer is often different in each source, so to merge these sources you must uncover a common attribute or set of attributes.

In some cases, you may have a unique identifier assigned to the customer and utilized across multiple systems. This makes the process of pulling data together from multiple systems relatively easy. The problem is that this magical attribute is often not available. With systems developed independently, at times by multiple providers, a single unifying attribute across these systems to represent a unique customer is often missing.

At this point, you have to find an alternative attribute or set of attributes on which to merge the data. Some attributes are more likely to identify only a single customer, and are thus easier to use, than others.

A few attributes such as tax identification numbers can clearly identify one customer across multiple interactions. The purpose of these numbers is for governments to consolidate information across multiple financial transactions. Not all customer transactions need to be under government oversight, however, and so this type of ID is often not available.

In addition, capturing and maintaining this type of data brings with it a higher level of liability. The company has a responsibility to protect this data for the customer.

Other financial attributes, such as credit card or bank account numbers, also provide a high probability of uniqueness and carry a level of liability for the company. By storing these, a company accepts the responsibility from both the consumer and the financial provider to protect the information.

To protect consumers' financial information, most systems don't store the credit card or bank account number in plain text. Some hash the number so that the system has a fingerprint of it but do not take on the liability of managing the card number in a way that it can be retrieved. Others only store a small portion of the number, such as the last four digits, for later transaction identification. Others will use industry standard encryption methods to protect the data, but have the keys to decrypt the information as needed.

As long as the hashed or encrypted values are transformed using a common algorithm and a common encryption key across systems, the modified values can be used to merge customer data as well as the plain text version.

Although credit card numbers provide a high likelihood of uniqueness across systems for a customer, many people hold multiple credit cards and use them under different circumstances. This can make it more difficult to unify a customer profile based solely on credit card or bank account number.

Personal attributes, such as full name and full address, are more plentiful and have fewer restrictions on their availability but also have more limitations on their ability to simply join data sets in their raw format.

Used independently, joining customer profiles is difficult on an attribute such as name or address. Used jointly, there is a much higher probability of uniquely identifying a customer across multiple systems. In other words, matching only on last name is likely to generate many false positive matches. Matching on first name, last name, city, state, and postal code reduces the chance of a false positive match.

When working with attributes such as name and address, spelling, capitalization, spacing, and abbreviation all have an impact on the comparison. If the attributes are not exactly the same in both systems, a simple equivalence test will fail.

To address this challenge, a number of methods allow you to compare strings of characters that are close, but not exact. Soundex, metaphone, NYSIIS, and Daitch-Mokotoff Soundex all convert words into a coded representation of their phonetics to allow direct comparison.

Metrics such as the Levenshtein distance measure the relative closeness of two strings. Levenshtein distance determines how many changes have to be made to convert one word into another, including insertion, deletion, and replacement of characters.

These phonetic representations and distance calculations allow for comparison of similar strings that are not exact and you can use these types of tools to find attributes across data sources that are similar enough to unify a customer profile.

Finding a common attribute or common set of attributes is not always easy, but it is critical for analytics practitioners to be able to unify multiple sources of data. Successfully combining sources is absolutely necessary to get the most value from your analytics.

About the Author

Troy Hiltbrand is the senior vice president of digital product management and analytics at Partner.co where he is responsible for its enterprise analytics and digital product strategy. You can reach the author via email.

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.

TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Research & Resources

Webinars

Virtual Summits

TDWI Articles

Merging Multisource Customer Data

Related Articles

Trending Articles

From Reactive to Proactive: Automating Data Quality in Petabyte-Scale Analytics Pipelines

From Pilot to Production: Why LLM Features Stall, and a Readiness Checklist for Data Leaders

The Inferencing Cost Problem No One Is Talking About: Unstructured Data Quality

The Hidden Cost of Poor Training Data in Generative AI

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI

Engage

Research

Research & Resources

Webinars

Virtual Summits

TDWI Articles

Merging Multisource Customer Data

Related Articles

Trending Articles

From Reactive to Proactive: Automating Data Quality in Petabyte-Scale Analytics Pipelines

From Pilot to Production: Why LLM Features Stall, and a Readiness Checklist for Data Leaders

The Inferencing Cost Problem No One Is Talking About: Unstructured Data Quality

The Hidden Cost of Poor Training Data in Generative AI

TDWI Membership

Accelerate Your Projects, and Your Career

TDWI

Engage

Research

Accelerate Your Projects,
and Your Career