What Is AI Watermarking and Why Is It So Hard to Do?
When AI-generated text started becoming indistinguishable from human writing, the intuitive response from researchers, policymakers, and the public was: can't we just mark it somehow?
The answer is: sort of, sometimes, under favorable conditions, until someone tries to remove the mark.
That's not the answer most people were hoping for. Understanding why it's the honest one requires getting into what watermarking actually involves and where the fundamental difficulties lie.
Text watermarking is the hardest version of the problem, and it's worth being direct about why. Text is infinitely malleable. The same meaning can be expressed in thousands of different ways, with different word choices, different sentence structures, different orderings of ideas. Any mark embedded in a specific word choice or sentence structure can be trivially removed by rephrasing. You don't need a sophisticated attack. You need a synonym.
The most technically serious approach to text watermarking works by biasing the token selection process during generation. A language model chooses the next token from a probability distribution at each step. A watermarking scheme partitions the vocabulary into two groups, typically called green tokens and red tokens, using a secret key and the preceding context to determine which tokens fall into which group. The model is nudged to prefer green tokens during generation, by a small enough amount that output quality isn't noticeably affected. The resulting text contains a statistical signature: green tokens appear more often than chance would predict. A detector with the secret key can measure whether the green token frequency is elevated and make a probabilistic judgment about whether the text is AI-generated.
This is clever. It's also fragile in ways that limit its practical usefulness.
Short texts don't contain enough tokens for the statistical signal to be reliable. The watermark is probabilistic, not certain, which means it produces false positives and false negatives at rates that depend on text length. Paraphrasing dilutes the signal, because rephrasing changes which tokens appear and disrupts the green-red balance. Translation into another language and back typically destroys the watermark entirely. Copying a sentence from a watermarked text into a human-written document produces a fragment that may not be detectable. And critically, the watermark tells you that some model using this watermarking scheme probably generated this text. It doesn't tell you which model, or when, or what the original prompt was.
None of this requires a sophisticated attacker. A student submitting AI-generated work who runs it through a paraphrasing tool first has likely defeated the watermark without knowing anything about how watermarking works. The fundamental problem is that the information content of text is in its meaning, not its exact word choices, and watermarks live in the word choices. Meaning is preserved through paraphrase. Watermarks are not.
Image watermarking is more tractable, and the techniques are more mature, partly because they were developed in the context of digital rights management long before generative AI made them relevant for content provenance. The basic approaches embed imperceptible perturbations in pixel values during generation, perturbations that survive common transformations like compression, resizing, and color adjustment, but that can be detected by a trained classifier. Because images don't have the semantic flexibility of text, changing the image to remove the watermark tends to also change the image in visible ways.
Tends to. Not always. Researchers have demonstrated watermark removal attacks that add carefully crafted noise to an image, disrupting the embedded signal without introducing visible degradation. Cropping, flipping, and heavy compression can also degrade watermark signals. The adversarial dynamic between watermark designers and watermark removers is ongoing, and the history of digital rights management doesn't inspire confidence that watermarks will remain ahead of removal techniques indefinitely.
Audio watermarking is further along than text watermarking for similar reasons to images: the signal can be embedded in frequency components that survive common audio transformations like compression and resampling, and removing it requires degrading the audio in ways that may be audible. Video watermarking inherits both the image approaches and introduces additional complexity from the temporal dimension, but the same basic logic applies.
Cryptographic approaches offer stronger guarantees than statistical embedding. Content can be signed with a private key at generation time, and the signature verified with a public key. This provides a genuine guarantee that the content came from a specific source. The problem is that signatures can be stripped. Unsigned content from the same model can't be distinguished from human-created content. And the approach requires infrastructure for key management and verification that doesn't yet exist at the scale needed for it to be generally useful.
The C2PA standard, developed by a coalition of technology companies and camera manufacturers, attempts to establish a common framework for content provenance that combines cryptographic signing with standardized metadata about how content was created. It's a more serious attempt at the problem than statistical watermarking, and it has real adoption in some contexts. Its limitation is the same as all provenance approaches: it only works for content that carries the provenance metadata. Content with the metadata stripped, or content generated by systems that don't implement the standard, provides no signal.
The honest conclusion is that watermarking is a meaningful friction-raiser rather than a reliable detection mechanism. It makes casual misrepresentation of AI content harder. It provides signals that can inform content moderation at scale when the content comes from cooperative systems. It doesn't provide certainty, and it doesn't work reliably against anyone who understands how to defeat it. For text specifically, where the fundamental malleability of language defeats the approach structurally rather than just practically, expecting watermarking to solve the AI content identification problem is expecting more than the technique can deliver.