Aurora AITell us your case

Offering

ServicesProductsCase studies

For whom

Private EquityEnterpriseSMB
ServicesProductsCase studiesAboutBlogContact

Knowledge base

Start hereWikiGlossaryGuides

AI Glossary

Multimodality

multimodal model, multimodality

Multimodality is a model's ability to process and combine different types of data — text, images, audio, or video — within a single request, instead of working with text alone.

Multimodality describes models that take in more than one kind of data. Such a model might, for example, receive a photo together with a text question and respond by describing what the image shows. Under the hood, every type of data is converted into a shared numerical representation, similar in role to embeddings.

This lets a single large language model combine information from text, images, and audio, rather than requiring a separate tool for each format. In practice it simplifies tasks such as analyzing scanned documents, captioning photos, or handling voice queries.

Related terms

Related articles