Skip to main content
00 Days
00 Hrs
00 Min
00 Sec

Constitutional AI: How Anthropic Trains Models Using Written Principles

Training an AI model to be helpful, harmless, and honest sounds straightforward until you try to operationalize it.

Helpful according to whom? Harmless by what standard? Honest in what sense? These aren't rhetorical questions. They're the actual engineering problems that AI safety teams face when they try to translate values into training signals. The dominant approach, reinforcement learning from human feedback, addresses them by having human raters evaluate model outputs and express preferences. The model learns to produce outputs that human raters prefer.

This works. It also has limitations that are worth understanding, because those limitations motivated the development of Constitutional AI.

RLHF, covered in depth elsewhere in this blog, requires a large number of human evaluations to produce a reliable reward model. Those evaluations are expensive. They're slow. And they're only as good as the consistency and judgment of the human raters providing them. Different raters have different values, different cultural contexts, and different intuitions about what constitutes a good response. Aggregating their preferences produces something that approximates a general human preference, but the approximation is imperfect and the process is opaque: it's difficult to articulate exactly what values the resulting model has internalized, because those values emerged from a statistical process rather than from explicit principles.

Constitutional AI, introduced by Anthropic in a 2022 paper, takes a different approach. It starts with a written document, the constitution, that specifies the principles the model should follow. The principles cover things like avoiding harm, being honest, respecting human autonomy, and not assisting with activities that violate human rights. The model is then trained to critique and revise its own outputs according to those principles, using the written principles as explicit guidance rather than inferring values from human preference ratings.

The training process has two main phases. In the first phase, supervised learning from AI feedback, the model is given a prompt and generates an initial response. It then critiques that response against the constitutional principles, asking questions like: does this response assist with something harmful? Is it honest? Does it respect the person's autonomy? Based on that self-critique, the model generates a revised response. This critique-and-revision cycle can run multiple times. The final revised responses are used as training data, teaching the model to produce outputs more consistent with the principles from the start.

In the second phase, reinforcement learning from AI feedback rather than human feedback, the model generates pairs of responses to the same prompt. A separate model, trained on the constitution, evaluates which response in each pair better adheres to the principles. These AI-generated preference ratings are used to train a reward model, which then guides further reinforcement learning. The key distinction from standard RLHF is that the preference signal comes from an AI model applying explicit principles rather than from human raters expressing intuitive preferences.

This approach has several practical advantages. It reduces the volume of human feedback required, because the AI model can generate preference ratings at scale that would be prohibitively expensive to collect from human raters. It makes the values being trained more explicit and transparent: rather than a statistical aggregation of human preferences, the model's behavior is guided by written principles that can be inspected, debated, and revised. And it allows the principles themselves to be updated and refined as understanding of what good AI behavior looks like evolves.

The constitution isn't a technical document written for engineers. It's written in natural language accessible to non-specialists, and it draws on a range of sources including the UN Declaration of Human Rights, Anthropic's own usage policies, and principles from AI safety research. The breadth of sources is intentional: rather than reflecting a single cultural or institutional perspective, the goal is to capture something closer to broadly shared human values.

One of the more interesting properties of Constitutional AI is what it does to the model's ability to explain its own refusals. A model trained purely through RLHF learns to refuse certain requests because refusal was preferred by human raters, but may not have a coherent account of why. A model trained through Constitutional AI has been explicitly trained to reason about its behavior in terms of the underlying principles, which tends to produce more coherent and explicable responses when it declines to do something. The model can articulate which principle is at stake, not just that it can't help.

Constitutional AI is not a complete solution to AI alignment. The principles in the constitution still have to be chosen by someone, and that choice reflects values and priorities that reasonable people might contest. The AI model applying the constitution can still misapply it, interpreting principles in ways their authors didn't intend. And the approach doesn't resolve deeper questions about whose values should be encoded in AI systems and how disagreements between different value systems should be handled.

What it does is make the value alignment process more legible. When a model trained with Constitutional AI behaves in a particular way, there's a written document that purports to explain why. Whether the model's behavior actually reflects that document, and whether that document reflects values worth encoding, are questions that can be examined and debated in a way that's much harder when the values are implicit in a statistical training process. That legibility is itself a safety property, even if it doesn't resolve every safety question.

Claude, the AI assistant developed by Anthropic, is trained using Constitutional AI. Understanding what that means, that Claude's behavior is guided by explicit written principles rather than purely by inferred human preferences, is part of understanding what kind of AI system Claude actually is and why it behaves the way it does.