What is Multimodality?

AI Glossary

Multimodality

multimodal model, multimodality

Multimodality is a model's ability to process and combine different types of data — text, images, audio, or video — within a single request, instead of working with text alone.

Combines different formats in one model: text, image, audio, video.
Lets you ask about an image in words or describe a sound in text.
Each data type is reduced to a shared numerical representation.

Multimodality describes models that take in more than one kind of data. Such a model might, for example, receive a photo together with a text question and respond by describing what the image shows. Under the hood, every type of data is converted into a shared numerical representation, similar in role to embeddings.

This lets a single large language model combine information from text, images, and audio, rather than requiring a separate tool for each format. In practice it simplifies tasks such as analyzing scanned documents, captioning photos, or handling voice queries.

Related terms