TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Data 101

00 Days

00 Hrs

00 Min

00 Sec

Entity Resolution: How Organizations Figure Out That Two Records Are the Same Person

A large company almost never has one record per customer. It has many. The same person signed up on the website as "Robert Smith," called support and got entered as "Bob Smith," made a purchase under "R. Smith" with a typo in the street address, and appears a fourth time in a list acquired when the company bought a competitor. Four records. One human being. And no field anywhere that says so.

Entity resolution is the practice of figuring out that those four records are the same person, and doing it reliably across millions of records where the clues are partial, inconsistent, and frequently wrong. It's a deceptively hard problem, and solving it is what stands between an organization and the elusive goal of knowing who its customers actually are.

The reason it's hard is that the obvious solution doesn't work. You might think you could just match records that are identical, and be done. But records describing the same person are almost never identical. Names get shortened, misspelled, or entered with a middle initial in one place and not another. Addresses change when people move, and get abbreviated differently every time. Phone numbers get formatted a dozen ways. Information goes stale. Fields get left blank. The real world is messy, and that mess lands directly in the data.

So entity resolution can't rely on exact matches. It has to make judgments about similarity, and that turns out to require some genuine cleverness.

The simplest approach is called deterministic matching, and it works on firm rules. If two records share the same email address, call them the same person. If they share the same government ID number, same person. Deterministic matching is fast, easy to understand, and completely trustworthy when you have a reliable shared identifier to match on. When two records carry the same Social Security number, there's not much room for doubt.

The problem is that reliable identifiers are often missing or themselves inconsistent. Not every record has an email. People share email addresses. Someone fat-fingers a digit in an ID number. And many of the records you most want to match, the ones from different systems built at different times, were never designed to share an identifier in the first place. Deterministic matching catches the easy cases and leaves the hard ones untouched.

For those harder cases, there's probabilistic matching, and this is where entity resolution gets interesting.

Instead of demanding an exact match on a single field, probabilistic matching weighs evidence across many fields and computes how likely it is that two records refer to the same entity. It treats matching as a question of accumulated probability rather than a yes-or-no rule. Two records with similar names, the same date of birth, and addresses one digit apart are probably the same person, even though no single field matches perfectly. Two records that share only a common last name probably aren't. The system assigns weight to each kind of agreement, adds it up, and produces a score.

What makes this powerful is that it mirrors how a human would reason. If you saw "Robert Smith, born March 1985, living at 42 Oak Street" next to "Bob Smith, born March 1985, living at 42 Oak St," you wouldn't hesitate to conclude they're the same person, even though the name and address don't match character for character. You're weighing the evidence. Probabilistic matching is a way of teaching a computer to weigh it the same way, but across millions of comparisons no human could ever do by hand.

The scoring leads to three outcomes rather than two. Some pairs score high enough to confidently call a match. Some score low enough to confidently call a non-match. And some land in an uncertain middle, where the evidence is suggestive but not conclusive. Many systems route those middle cases to a human reviewer, who makes the final call. The machine handles the volume, the person handles the genuinely ambiguous, and over time the borderline decisions can be fed back to improve the matching itself.

There's a practical obstacle that has to be solved before any of this can run at scale, and it's a matter of sheer arithmetic. Comparing every record against every other record gets impossible fast. A million records would mean roughly half a trillion comparisons. No system can afford that. The standard fix is a technique called blocking, which first sorts records into rough groups that could plausibly match, by shared zip code, say, or the first few letters of a last name, and then only runs the expensive detailed comparisons within each group. Blocking trades a small risk of missing an unusual match for an enormous gain in speed, and without it, entity resolution at real-world scale simply wouldn't be feasible.

Once the matches are found, the organization faces one more decision: what to do with them. Recognizing that four records describe one person is the discovery. Acting on it is the resolution. Some organizations merge the records into a single consolidated master record. Others leave the original records in place but link them together with a shared identifier, preserving the source data while recording the relationship. Both approaches are common, and the right one depends on how much the original records need to be preserved for audit or operational reasons.

This is the machinery underneath master data management and the single view of the customer that so many organizations chase. The goal of having one trustworthy answer to "how many customers do we have" and "what do we know about this person" depends entirely on being able to recognize when scattered records describe the same entity. Entity resolution is how that recognition happens. It's quiet, technical work that rarely gets attention, but the clean, unified customer view that executives ask for is impossible without it.

Data 101

Entity Resolution: How Organizations Figure Out That Two Records Are the Same Person

TDWI

Engage

Research