AI Glossary
Extended thinking (reasoning effort)
extended thinking, reasoning effort, thinking mode, thinking budget
Extended thinking is a mode in which a model generates internal reasoning before giving its final answer. It trades higher latency and token usage for greater accuracy on hard tasks.
- The model produces internal reasoning steps before delivering the final answer.
- It raises accuracy on complex tasks at the cost of latency and token count.
- The amount of thinking can be tuned — from brief to deep — depending on task difficulty.
Extended thinking is a mode of operation in which, before issuing a final answer, the model is allowed to generate a stream of internal reasoning — to lay out the steps, check alternatives, catch an error. Only then does it formulate its final answer. It's a deliberate trade-off: response time and the number of tokens used both rise, but on hard tasks — math, multi-step analysis, code — accuracy rises noticeably.
The mechanism is related to the chain-of-thought technique, in which the model is asked to "think step by step." The difference is that extended thinking is sometimes built into the model itself or controlled by a reasoning-effort parameter, rather than forced solely through the prompt's content. Models designed around this mechanism are called reasoning models.
For a deployment, the ability to tune the effort is what matters in practice. Simple, high-volume queries are handled in a fast, cheap mode, and extended thinking is switched on where an error is costly and a longer, more expensive inference pays off. Controlling this level is a real lever over cost and quality in production use of models.
Related terms