AI Glossary
Attention mechanism
attention, self-attention, attention mechanism
The attention mechanism lets a model weigh which input tokens matter when generating each element of the output. It is the core of the transformer architecture and the foundation of today's language models.
- For each token it computes weights that capture how strongly it relates to the others.
- It is the core of the transformer architecture and underpins the model's grasp of context.
- Compute cost grows with the square of the sequence length, which limits the context window.
The attention mechanism is the operation by which a model, at each step, judges which parts of the input matter most to the current decision. For each token the model computes weights that indicate how strongly it relates to the other tokens in the sequence, and on that basis builds a contextual representation. This is exactly what lets it distinguish the meaning of the same word across different sentences.
The attention mechanism is the heart of the transformer architecture — it is what replaced earlier, sequential approaches and made it possible to process an entire sequence in parallel at once. In the self-attention variant the model compares every token with every other one, which gives it a full picture of the dependencies, including long-range ones.
The practical consequence is cost: the number of comparisons grows with the square of the text length, which is why the size of the context window is limited and expensive to extend. Much of the research into the efficiency of modern models is an attempt to speed up or approximate the attention mechanism itself, because it is the single biggest factor in how much text a model can realistically hold at once.
Related terms