TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Data 101

00 Days

00 Hrs

00 Min

00 Sec

Data Quality for AI: Why the Standards Are Higher Than You Think

Data quality has been a concern in enterprise data management for decades. Duplicate records, missing values, inconsistent formats, stale information — these problems are familiar, and their consequences in traditional analytics are familiar too. A report shows the wrong number. A dashboard misleads. An analyst spends time cleaning data that should have arrived clean.

Those consequences are real. They're also, in an important sense, visible. A wrong number in a report can be caught, questioned, and corrected.

AI systems introduce a different relationship between data quality and outcomes. When an AI model trains on low-quality data, the problems don't show up as obviously wrong outputs. They show up as subtly wrong ones. A model that learned from biased, incomplete, or mislabeled training data will produce outputs that look plausible, that pass basic sanity checks, and that may be systematically wrong in ways that only become apparent under specific conditions or over time. The error is encoded into the model itself, not just into a downstream report.

This is what makes data quality for AI a meaningfully different problem from data quality for analytics.

In traditional analytics, data quality affects the accuracy of specific outputs. Fix the data and you fix the report. In machine learning, data quality affects what the model learned. You can't fix a trained model by fixing the data after the fact. You have to retrain it, which means identifying the quality problem, correcting the underlying data, and running the training process again. Depending on the scale of the model and the training infrastructure involved, that's an expensive proposition. The cost of poor data quality in AI compounds in ways it doesn't in analytics.

The specific quality dimensions that matter most for AI are somewhat different from those that dominate traditional data quality conversations. Completeness matters, but not just in the sense of missing values. It matters in the sense of coverage: does the training dataset adequately represent the range of situations the model will encounter in production? A model trained on data that covers common cases well but rare cases poorly will be unreliable precisely when reliability matters most, in the edge cases it hasn't seen enough of.

Labeling quality is a dimension that has no direct equivalent in traditional analytics. Supervised machine learning requires training data where each example is labeled with the correct answer, the thing the model is being trained to predict. Those labels have to be accurate. Systematic errors in labeling, cases where the wrong label was applied consistently across a category of examples, teach the model the wrong thing in ways that are difficult to detect and expensive to correct. The process of labeling data at scale, often involving human annotators working through large volumes of examples, introduces its own quality risks that have to be actively managed.

Representativeness is a quality dimension that traditional data management rarely has to confront directly. For analytics, you generally want accurate data about what actually happened. For AI training, you want data that accurately represents the distribution of things the model will encounter in production. Those are related but not identical requirements. A dataset can be accurate about what happened historically while still being unrepresentative of the conditions the model will face, because the world has changed, because the deployment context differs from the data collection context, or because the historical data reflects patterns that shouldn't be perpetuated. A fraud detection model trained on historical fraud data that reflects the fraud patterns of five years ago may be well-labeled and internally consistent while still being a poor training dataset for detecting current fraud techniques.

Consistency across time matters more for AI than for static analytics. Models deployed in production continue to receive new data, and the statistical properties of that data shift over time in ways that degrade model performance. This is the model drift problem covered elsewhere in this blog. But drift is fundamentally a data quality problem: the data the model is seeing in production no longer looks like the data it was trained on. Managing this requires ongoing monitoring of data distributions, not just one-time quality checks at ingestion.

The practical implication for organizations building AI capabilities is that data quality programs designed for analytics are necessary but not sufficient. The additional requirements, coverage assessment, label quality management, representativeness evaluation, distributional monitoring, require new practices, new tooling, and a different way of thinking about what it means for data to be good enough. Good enough for a report and good enough to train a model on are different standards, and organizations that don't recognize that distinction tend to find out the hard way.

Data 101

Data Quality for AI: Why the Standards Are Higher Than You Think

TDWI

Engage

Research