Multimodal
Definition
An AI model's capability to process multiple input and output modalities — typically combinations of text, image, audio and video. In 2026, multimodality is standard in frontier models and remains an architectural decision in specialised models.
Noise — Signal
Multimodality is often presented as a universal capability. In practice the modalities are unevenly covered: text and image understanding are stable, audio generation is delicate in regulated applications (voice cloning, authenticity), video generation remains quality-sensitive. Cost and latency also scale with modality — an image in the prompt is often equivalent to several thousand tokens, video to several hundred thousand.
The right question
Not: "Do we need a multimodal model?" But: "Which modality delivers demonstrable value over a text pipeline for which concrete use case, and does that value justify the cost and compliance implications?"