Skip to main content
00 Days
00 Hrs
00 Min
00 Sec

Data Poisoning: How Attackers Corrupt AI Models Before They're Even Deployed

Data poisoning works by introducing carefully crafted examples into a training dataset that cause the resulting model to behave in attacker-chosen ways. The poisoned examples look legitimate during data curation, because making them look legitimate is part of the attack. By the time anyone might notice something is wrong, the model has already learned from them.

The two main categories of data poisoning attacks have different objectives and different mechanisms. Availability attacks aim to degrade the model's overall performance, making it less accurate or reliable across the board. Integrity attacks aim to introduce specific behaviors, causing the model to behave correctly on most inputs while failing in targeted ways on attacker-chosen inputs. Integrity attacks are generally considered more dangerous because they're harder to detect: a model that performs well on standard benchmarks but has a hidden backdoor will pass most quality checks without triggering any alarm.

Backdoor attacks are the most technically sophisticated form of data poisoning. The attacker introduces training examples that contain a specific trigger, an image watermark, a specific phrase, a particular formatting pattern, paired with an attacker-chosen label or output. The model learns to associate the trigger with the attacker-chosen behavior. In deployment, the model behaves normally on clean inputs. When it encounters an input containing the trigger, it produces the attacker-chosen output regardless of what the input actually contains. A content moderation model might learn to classify any image containing a specific barely-visible pattern as safe. A malware detection model might learn to classify any file containing a specific string as benign.

The scale of poisoning required to mount an effective attack is smaller than intuition suggests. Research has demonstrated meaningful effects from poisoning as little as 0.1% of a training dataset in some settings. The exact threshold depends on the model architecture, the training procedure, and the specific behavior being induced, but the consistent finding is that attackers don't need to poison a large fraction of the data to have a meaningful impact. For datasets assembled from public sources, where the attacker can contribute a small number of examples without controlling the majority of the data, this makes the attack practically feasible.

Web-scraped training data is particularly vulnerable. Large language models and image models are typically trained on data collected from the internet at scale, with automated filtering rather than manual review. An attacker who controls a website, or who can contribute content to platforms whose content gets scraped, can potentially inject poisoned examples into future training runs. The attack requires patience, the attacker needs to have content in place before the scraping happens, but it doesn't require any access to the training infrastructure itself. The vulnerability is in the data collection pipeline, not the training pipeline.

Federated learning, covered in a separate piece in this blog, introduces a specific variant of the poisoning problem. In federated learning, model updates are contributed by many participants and aggregated centrally. A malicious participant can contribute poisoned updates that influence the global model rather than poisoning training data directly. The aggregation mechanism that makes federated learning work, combining updates from many participants, also means that a malicious participant's contributions are mixed into the global model without direct inspection. Defenses like differential privacy and robust aggregation methods that down-weight outlier contributions help, but they don't eliminate the risk entirely.

Detecting data poisoning is genuinely difficult, and no reliable general-purpose detection method exists. Some approaches look for anomalies in training data, examples whose labels seem inconsistent with their features, or clusters of similar examples that appear more frequently than expected. Others train multiple models on different subsets of the data and look for inconsistencies in their behavior that might indicate a poisoned subset. Activation clustering examines the internal representations a model develops for training examples, looking for unusual patterns that might indicate a backdoor. Each of these approaches catches some attacks and misses others, and all of them add significant overhead to the training process.

For organizations assembling training datasets from external sources, the practical response involves treating data provenance with the same seriousness applied to software dependencies. Knowing where training data came from, who produced it, and under what conditions is the foundation of any meaningful defense. Preferring data from sources with documented curation processes over data scraped from the open web reduces but doesn't eliminate risk. Testing trained models on a diverse suite of inputs designed to reveal backdoor behavior, including inputs with potential trigger patterns, provides some assurance without guaranteeing detection.

The uncomfortable reality is that the AI development pipeline has a significant security gap at the data layer that the field has been slower to address than gaps at the model and deployment layers. Training data is often treated as a given, a resource to be collected and processed rather than a potential attack surface to be defended. As AI systems take on higher-stakes roles, the consequences of deploying a model trained on poisoned data become more serious, and the gap between the sophistication of training-time attacks and the maturity of training-time defenses becomes a more pressing problem.