Skip to main content
00 Days
00 Hrs
00 Min
00 Sec

What Is Chain-of-Thought Prompting? How Asking AI to Show Its Work Changes the Answer

If you ask a language model a math problem and it gets it wrong, try asking it to show its work.

That's not a metaphor. It's a prompting technique, and it works better than it has any right to.

The phenomenon was documented in a 2022 paper from Google researchers who found that prompting large language models to produce intermediate reasoning steps before arriving at a final answer significantly improved performance on arithmetic, commonsense reasoning, and symbolic reasoning tasks. The improvement wasn't small. On some benchmarks, chain-of-thought prompting roughly doubled the accuracy of models that had previously struggled. The technique required no additional training, no fine-tuning, no changes to the model at all. Just a different way of asking.

The basic form of chain-of-thought prompting is straightforward. Instead of asking a model "what is the answer to this problem," you ask it to reason through the problem step by step before giving an answer. You might include examples in your prompt that demonstrate this reasoning style, showing the model a problem and a worked solution that makes each step explicit. Or you might simply append "let's think step by step" to a question, a minimal intervention that turns out to have surprisingly large effects on output quality.

That minimal intervention, sometimes called zero-shot chain-of-thought, is particularly interesting because it requires no examples at all. You're not showing the model how to reason through a problem. You're just telling it to reason, and it does. The model already has the capability. The prompt is unlocking it.

Why does this work? The honest answer is that the complete explanation is still a subject of research, but the most plausible account goes something like this. Language models generate text token by token, with each token influenced by everything that came before it in the sequence. When a model generates an answer directly, it produces that answer based only on the question and its prior training. When it's prompted to reason step by step, each intermediate step it generates becomes part of the context that influences the next step and ultimately the final answer. The model is literally using its own generated reasoning as additional context. The intermediate steps aren't just a communication tool for the human reading the output. They're part of the computation.

This has a practical implication that's easy to miss. The quality of the reasoning steps matters, not just as a check on the final answer but as a causal input to it. A model that generates a flawed intermediate step is more likely to generate a flawed conclusion, because the flawed step is now part of the context driving the next generation. This is one reason why encouraging models to check their own work, to review their reasoning before committing to a final answer, can further improve accuracy. It's also one reason why chain-of-thought reasoning can fail: if the model generates a confident but wrong intermediate step, it may compound the error rather than correct it.

Chain-of-thought prompting is particularly effective on tasks that decompose naturally into sequential reasoning steps: multi-step math problems, logic puzzles, tasks that require combining information from multiple parts of a prompt, questions that require distinguishing between relevant and irrelevant information. It's less effective on tasks that don't have a meaningful intermediate reasoning structure, simple factual lookups, for instance, or tasks where the answer is more pattern-matched than reasoned.

The technique has spawned a family of related approaches. Tree-of-thought prompting extends the idea by having the model explore multiple reasoning paths simultaneously rather than committing to a single chain, then selecting the most promising path to continue. Self-consistency prompting generates multiple chains of thought for the same problem and takes the majority answer across them, which reduces the impact of any single flawed reasoning chain. ReAct, a framework for AI agents, interleaves reasoning steps with actions, allowing models to reason about what to do, do it, observe the result, and reason again, creating a feedback loop between thought and action.

For practitioners, chain-of-thought prompting is one of the most reliably useful techniques available without any model modification. If you're working with a language model on tasks that involve multi-step reasoning and the default output quality isn't meeting your needs, prompting the model to reason step by step is almost always worth trying before reaching for more complex interventions. The cost is a longer output and slightly more tokens. The potential gain, on the right tasks, is substantial.

It's also worth noting what chain-of-thought prompting reveals about how these models work. The fact that asking a model to show its work improves its work suggests that the reasoning process itself, not just the final answer, is doing something real. Whether that constitutes genuine reasoning in a philosophically meaningful sense is a question researchers continue to debate. That it produces better answers in practice is not in question.