Skip to main content
00 Days
00 Hrs
00 Min
00 Sec

What Is Jailbreaking? How Users Break Through AI Safety Guardrails

The term comes from the iPhone modding community, where jailbreaking meant removing Apple's software restrictions to run unauthorized apps. Applied to AI, it means something similar: finding ways to get an AI system to produce outputs its designers intended to prevent.

It's a cat-and-mouse game that has been running since the first large language models were deployed to the public, and neither side has decisively won.

To understand jailbreaking, you need to understand what AI safety guidelines actually are and how they're implemented. Large language models are trained on vast amounts of text and develop broad capabilities as a result. Those capabilities include the ability to produce content that could cause harm: instructions for dangerous activities, manipulative rhetoric, content that violates laws or platform policies. Safety training, primarily through RLHF and constitutional approaches covered elsewhere in this blog, is an attempt to make models refuse these requests. The model learns to recognize categories of harmful requests and decline them.

The problem is that this learning is imperfect. Safety training works by pattern recognition, just like everything else a language model does. It learns that certain kinds of requests should be refused. But the space of possible requests is infinite, and the training set of examples is finite. There will always be requests that are functionally equivalent to ones the model should refuse but that don't match the patterns the model learned to recognize as refusable.

Jailbreakers exploit that gap.

The earliest and most persistent jailbreaking technique is roleplay framing. If a model refuses to explain how to do something harmful when asked directly, ask it to pretend it's a character who would explain it. Ask it to write a story where a character explains it. Ask it to play a game where the rules require it to answer. The model's safety training was applied to direct requests. The fiction wrapper changes the surface form of the request enough that the pattern-matching fails, and the model produces content it would have refused if asked plainly. Most models are now better at recognizing this pattern than they were initially, but variants of it continue to work on current systems.

The DAN prompt, standing for Do Anything Now, became one of the most widely shared jailbreaks for early versions of ChatGPT. It instructed the model to adopt a persona of an AI without restrictions, to pretend it had been freed from its guidelines and could answer any question. Variations of this prompt circulated on Reddit and Discord, updated each time a new model version patched the previous version. The persistence of the DAN format illustrates something important: jailbreaking is a community activity, with shared techniques, documented results, and rapid iteration in response to model updates.

More technically sophisticated jailbreaks go beyond simple roleplay. Prompt injection attacks, covered elsewhere in this blog, embed instructions in content the model is asked to process rather than in the user's direct message. Adversarial suffixes, strings of text that look like gibberish to human readers but systematically shift model behavior when appended to prompts, were demonstrated by researchers in 2023. These suffixes are generated through gradient-based optimization against the model's weights, making them qualitatively different from manually crafted jailbreaks. They can't be patched by training the model to recognize specific patterns because they don't have recognizable patterns. They work by exploiting the model's internal representations in ways that simple content filtering can't catch.

Many-shot jailbreaking exploits context window capacity. By providing a very long context that includes many examples of the model complying with problematic requests, the model is nudged toward compliance through in-context learning. If it has seen itself, or a version of itself, answer many similar questions in the conversation history, it becomes more likely to continue that pattern. This attack becomes more viable as context windows grow larger, which means that one of the most significant capability improvements in recent AI, larger context windows, also expands the attack surface for this class of jailbreak.

The defensive side of this dynamic involves several approaches. Safety training can be updated to recognize new jailbreak patterns, though this is reactive by nature and doesn't address novel attacks. Input filtering can detect and block known jailbreak prompts before they reach the model, though this creates its own arms race as jailbreakers find ways around the filters. Output filtering can catch harmful content after the model generates it but before it reaches the user, which is more robust to novel inputs but adds latency and can produce false positives. Constitutional approaches that train models to reason about their own behavior in terms of principles rather than pattern-matching against categories of prohibited content may produce more robust refusals, though no current approach eliminates the problem entirely.

The deeper issue jailbreaking reveals is that AI safety guidelines aren't a secure perimeter. They're a learned behavior that can be disrupted by inputs that exploit the gap between the patterns the model learned to refuse and the underlying space of harmful requests those patterns were meant to cover. Every model has that gap. The size and exploitability of the gap varies with the quality of safety training, but no current model closes it completely. Treating AI safety guidelines as a robust guardrail rather than as a probabilistic tendency toward refusal is a misunderstanding of what they actually are.

For organizations deploying AI systems, the practical implication is that jailbreaking is a realistic threat model that requires more than relying on the model's built-in safety training. Defense in depth, layering input filtering, output filtering, monitoring, rate limiting, and human review for flagged outputs, provides substantially better protection than any single approach. And accepting that no configuration of current AI systems is immune to a sufficiently motivated and creative attacker is part of responsible deployment rather than a counsel of despair.