What Is Model Stealing? How Attackers Clone AI Systems They Don't Own
Imagine spending years and hundreds of millions of dollars training a model, then watching a competitor deploy something functionally equivalent that they built by querying yours.
That's the threat model of model stealing attacks. And unlike many theoretical security concerns, it's one that has demonstrable practical feasibility.
Model stealing, sometimes called model extraction, exploits a fundamental property of machine learning APIs. When you query a model through an API, the model returns outputs. Those outputs contain information about the model's internal representations and decision boundaries. With enough queries and the right techniques, that information can be used to train a separate model that approximates the behavior of the original, without ever having access to the original model's weights, architecture, or training data.
The earliest model stealing research focused on simple classifiers. If you can query a classifier and observe its output probabilities for arbitrary inputs, you can use those input-output pairs as training data for a new classifier. Given enough queries covering enough of the input space, the new classifier learns to mimic the original. The fidelity of the copy depends on the number of queries, the coverage of the input space, and the complexity of the model being stolen. Simple models can be stolen with relatively few queries. Large, complex models require more, but the economics can still favor the attacker.
For large language models, the attack takes a different form. The output of a language model is text, not a probability vector over a small set of classes. But the outputs still contain information. A model's responses to a carefully designed set of questions reveal its knowledge, its reasoning patterns, its stylistic tendencies, and its capabilities across different domains. Using these responses as training data, a process called distillation from a black-box teacher, can produce a student model that approximates the original's behavior on the queried tasks. This is related to the distillation technique covered elsewhere in this blog, but done without the model provider's knowledge or consent.
The scale required to steal a frontier model through this approach is significant. Generating millions of high-quality training examples from a proprietary API costs money, takes time, and leaves traces in API logs. But the cost is orders of magnitude less than training a comparable model from scratch, which is precisely what makes the attack economically interesting. A competitor who can spend a few hundred thousand dollars on API queries to produce a model that would otherwise cost hundreds of millions to train has a significant economic incentive to do so.
The legal landscape around model stealing is genuinely uncertain. Using a model's outputs to train a competing model may violate terms of service, which most commercial AI APIs prohibit explicitly. Whether it constitutes copyright infringement, trade secret misappropriation, or some other form of actionable harm is less clear. Copyright in model weights is complicated by the unresolved questions about AI copyright discussed in a separate piece in this blog. Trade secret claims require demonstrating that the stolen information constitutes a trade secret and that the plaintiff took reasonable steps to protect it. These cases haven't been litigated extensively enough to produce clear precedent, and the law is developing slowly relative to the pace of the technology.
Detecting model stealing attacks is difficult. The API queries that constitute a stealing attack look, from the outside, like legitimate usage. High query volume might trigger rate limiting or billing alerts, but a sophisticated attacker can spread queries over time and across accounts to avoid detection thresholds. Fingerprinting techniques attempt to embed specific patterns in model outputs that survive the training process and appear in the stolen model, providing evidence that it was trained on outputs from the original. This is related to watermarking approaches covered elsewhere in this blog, and it shares watermarking's limitations: the fingerprint can potentially be removed or diluted, and its presence in a stolen model provides evidence of the attack but may not be sufficient for legal action.
Defenses against model stealing include rate limiting API access, returning truncated probability distributions rather than full softmax outputs, adding noise to outputs, and monitoring for query patterns that suggest systematic probing rather than legitimate use. Each of these defenses reduces the efficiency of stealing attacks at some cost to legitimate users. Truncating probability outputs makes it harder to extract fine-grained information about the model's decision boundaries but also makes it harder for legitimate users to calibrate their own systems on model confidence. The tradeoff between security and usability in API design is genuine and doesn't have a clean resolution.
For organizations that have invested significantly in training proprietary models, model stealing is a reason to think carefully about what information their APIs expose and what monitoring they have in place to detect suspicious usage patterns. For the broader AI ecosystem, it raises uncomfortable questions about the relationship between model capability and model ownership: if a sufficiently capable model can be functionally replicated through its outputs alone, what does it mean to own a model, and what legal and technical mechanisms can realistically protect that ownership?
Those questions don't have settled answers. The legal frameworks are catching up slowly to a technical reality that moves fast. In the meantime, the practical reality is that proprietary AI models are more vulnerable to extraction than their owners typically assume, and the economic incentives to attempt extraction are significant for anyone who wants the capability without the training cost.