Aurora AITell us your case

Offering

ServicesProductsCase studies

For whom

Private EquityEnterpriseSMB
ServicesProductsCase studiesAboutBlogContact

Knowledge base

Start hereWikiGlossaryGuides

AI Glossary

Tokenization

tokenization, splitting into tokens

Tokenization is the process of splitting text into tokens — short fragments a language model can process. It's a preprocessing step that turns raw text into a sequence of the model's input units.

Tokenization is the process of splitting text into tokens — short fragments such as pieces of words, whole words, or single characters — which are then assigned numbers the model can understand. It's a preprocessing step: before a large language model processes anything, the tokenizer turns the raw string of characters into a sequence of input units.

The key here is the distinction between the process and the unit. Tokenization is the act of splitting text, while a token is a single element that results from it. In other words, tokenization produces tokens — much as slicing produces slices. The rules of this operation are set by the model's trained tokenizer, which is why the same text can break into a different number of tokens depending on the model and the language.

The way text is tokenized has practical consequences. The context window and the cost of a query are both measured in tokens, so text that tokenizes into more fragments — common in Polish with its inflection of words — takes up more space and costs more. After tokenization, the sequence of tokens passes to the transformer architecture, which only then performs the actual language processing on that representation.

Related terms