AI Glossary
Multimodality
multimodal model, multimodality
Multimodality is a model's ability to process and combine different types of data — text, images, audio, or video — within a single request, instead of working with text alone.
- Combines different formats in one model: text, image, audio, video.
- Lets you ask about an image in words or describe a sound in text.
- Each data type is reduced to a shared numerical representation.
Multimodality describes models that take in more than one kind of data. Such a model might, for example, receive a photo together with a text question and respond by describing what the image shows. Under the hood, every type of data is converted into a shared numerical representation, similar in role to embeddings.
This lets a single large language model combine information from text, images, and audio, rather than requiring a separate tool for each format. In practice it simplifies tasks such as analyzing scanned documents, captioning photos, or handling voice queries.
Related terms
Related articles