TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Data 101

00 Days

00 Hrs

00 Min

00 Sec

Blog archive

What Is Data Profiling? Knowing What's in Your Data Before You Trust It

Data almost never looks the way its documentation says it does.

A column meant to hold phone numbers contains a few stray email addresses. A date field has entries from 1900 and one from 2099. Ten percent of the customer records are blank in a field everyone assumed was always populated. The documentation didn't mention any of this, because documentation describes how data is supposed to look, not how it actually does. That gap is where projects quietly go wrong.

Data profiling is the practice of examining a dataset to understand its structure, contents, and quality before you rely on it for anything. It's a diagnostic step, run at the beginning rather than the end, and its purpose is to replace assumptions about the data with facts about it. You look first, measure what's there, and then decide what the data is actually good for.

The temptation is to skip it. Profiling can feel like overhead, a delay between getting the data and doing something useful with it. But the cost of skipping it doesn't disappear; it just moves. Unexamined data tends to fail at the most expensive point, which is the end, after the analysis is built, when a number looks wrong and the investigation traces all the way back to a quality problem that was sitting in the raw data from the start.

Profiling generally works at three levels, moving from the simple to the subtle.

The first level looks at individual columns, one at a time. For each one, profiling asks the basic questions: what kind of data is in here, how many values are missing, what's the range, how many distinct values appear and how often does each show up. The answers are often where the first surprises turn up. A field meant to hold a percentage between zero and one hundred contains a value of 4,000. A column that should never be empty is blank in a quarter of the rows. None of it is complicated to find, and all of it changes what you do next.

The second level looks at relationships between columns, because fields can each look fine on their own and still contradict one another. A ship date that falls before its order date is impossible, but you'll only catch it by comparing the two. A country field that disagrees with its own postal code is the same kind of error. These inconsistencies are invisible when you examine fields in isolation, and they tend to cause outsized confusion downstream precisely because each individual piece looks correct.

The third level reaches across tables, checking whether the connections between datasets hold up. If every order is supposed to link to a customer, profiling verifies that no order points to a customer who doesn't exist. Here profiling starts to overlap with the structural integrity of the data model itself, confirming that the relationships the system relies on are real rather than assumed.

Across all three levels, the same handful of problems keep recurring.

Missing values are the most common, and the pattern of what's missing often tells a story of its own; a field that's empty only for older records usually points to some system change nobody wrote down. Inconsistent formats are close behind: dates written five ways, phone numbers with and without dashes, a country recorded as "USA," "U.S.A.," and "United States" in the same column. Then there are outliers, values far enough outside the normal range to be either errors or something that genuinely needs explaining, and duplicates, the same entity appearing more than once and quietly distorting every count and sum built on top of it.

Finding these things is not the goal in itself. The goal is to make informed decisions about what to do with them.

If a field is missing values everywhere, maybe it can't be used, or maybe the gaps have to be filled in some defensible way. If formats are inconsistent, the data needs standardizing before anything else can proceed. If there are impossible values, someone has to decide whether to correct them, discard them, or trace them back and fix the problem at the source. Profiling doesn't clean the data, but it tells you precisely what cleaning will involve, which lets you scope the work honestly instead of discovering its true size halfway through.

This is why profiling sits so early in data quality work and in any serious data science workflow. You can't improve quality you haven't measured, and you can't prepare data for analysis until you know what state it's in. Profiling is the measurement that makes everything after it possible, which is why it comes before cleaning, before transformation, and before modeling. Each of those steps rests on assumptions about the data, and profiling is how you find out whether those assumptions are true before you've built anything on them.

Data 101

What Is Data Profiling? Knowing What's in Your Data Before You Trust It

TDWI

Engage

Research