AI Glossary
Tokenization
tokenization, splitting into tokens
Tokenization is the process of splitting text into tokens — short fragments a language model can process. It's a preprocessing step that turns raw text into a sequence of the model's input units.
- It's the process of splitting text into tokens, done before the text reaches the model.
- Tokenization is the process and a token is the unit — that is, its result.
- The way text is tokenized affects how many tokens it takes up, and so the cost and context limit.
Tokenization is the process of splitting text into tokens — short fragments such as pieces of words, whole words, or single characters — which are then assigned numbers the model can understand. It's a preprocessing step: before a large language model processes anything, the tokenizer turns the raw string of characters into a sequence of input units.
The key here is the distinction between the process and the unit. Tokenization is the act of splitting text, while a token is a single element that results from it. In other words, tokenization produces tokens — much as slicing produces slices. The rules of this operation are set by the model's trained tokenizer, which is why the same text can break into a different number of tokens depending on the model and the language.
The way text is tokenized has practical consequences. The context window and the cost of a query are both measured in tokens, so text that tokenizes into more fragments — common in Polish with its inflection of words — takes up more space and costs more. After tokenization, the sequence of tokens passes to the transformer architecture, which only then performs the actual language processing on that representation.
Related terms