Historically, LLMs were only for text, and models like diffusion models were for images. But that line is blurring. New, cutting-edge models are multimodal, meaning they can handle more than one type of data.
The future of AI is not just about generating text or images. It’s about building models that can see, hear, and understand our world in its full, rich, and multifaceted complexity. And we are just at the beginning.
Google’s Gemini and OpenAI’s GPT-4o are examples. They can not only understand text but also process images, video, and audio.
The architecture of these models is still based on the principles of language models, but they have been expanded to handle other data types.
Multimodality works by using specialized neural networks to process different types of data, like text and images, and then combining the information into a single, comprehensive understanding. This approach mimics how humans use multiple senses to perceive the world, leading to more accurate and versatile AI systems.
+---------+ +---------+ +---------+ +---------+ +---------+
| Text | | Image | | Audio | | Video | | Sensors |
+---------+ +---------+ +---------+ +---------+ +---------+
\ | | | /
\ | | | /
\ | | | /
v v v v v
+---------------------------------------------------+
| Per-Modality Encoders |
| (text encoder, CNN, audio model, video, sensors) |
+---------------------------------------------------+
|
v
+---------------------------------------------------+
| Fusion Module |
| (early fusion / cross-attention / joint space) |
+---------------------------------------------------+
|
v
+---------------------------------------------------+
| Multimodal Backbone / Model |
| (shared representation, transformers, adapters) |
+---------------------------------------------------+
|
v
+---------------------------------------------------+
| Task-Specific Decoders |
| (text, image, audio, action generators) |
+---------------------------------------------------+
| | | |
v v v v
+-----------+ +-----------+ +-----------+ +-----------+
| Text | | Image | | Audio | | Actions |
| (answer, | | (caption, | | (speech, | | (control, |
| summary) | | edit) | | sound) | | API call) |
+-----------+ +-----------+ +-----------+ +-----------+The operation of a multimodal AI model can be broken down into three main stages:
All inputs—whether they are text, images, or audio—are converted into embeddings. This is a fundamental step that allows the model to process different types of data together.
An embedding is a numerical representation of an object, like a word, image, or sound. It’s essentially a vector, which is a list of numbers that captures the object’s characteristics and its relationships to other objects. Think of it like a coordinate on a map. Instead of just two dimensions (latitude and longitude), embeddings can have hundreds or even thousands of dimensions.
┌───────────────────┐
│ Raw Inputs │
│───────────────────│
│ "apple" (text) │
│ 🍎.jpg (image) │
│ crunch.wav (audio)│
└─────────┬─────────┘
│
v
┌───────────────────┐
│ Encoders │
│───────────────────│
│ Text → Vector │
│ Image → Vector │
│ Audio → Vector │
└─────────┬─────────┘
│
v
┌───────────────────┐
│ Embedding Space │ (high-dimensional, shown in 2D here)
└───────────────────┘
Example projection:
Vehicles Cluster
o o
o o
o 🚗 o
o
Fruits Cluster
o o o
o 🍎 o
o o
Sounds Cluster
o o
🔔
o
LEGEND:
o = embedding point (vector)
🍎 / 🚗 / 🔔 = example inputs mapped into their clusters
Nearby points = semantically similar embeddingsTo create these embeddings, multimodal models use specialized encoders for each type:
This section combines the “Fusion Module” and the “Multimodal Backbone”.
After each modality is converted into its own set of embeddings, the model’s “fusion” mechanism aligns and combines them so it can understand how the different pieces of data relate to one another.
┌─────────┐ ┌─────────┐ ┌─────────┐
Inputs → │ Text │ │ Image │ │ Audio │
└─────────┘ └─────────┘ └─────────┘
1) Early Fusion
┌─────────────────────────────────────────┐
│ [Concat Raw Features: pixels+wave+words]│
└───────────────┬─────────────────────────┘
v
Joint Encoder → Joint Embedding
2) Intermediate Fusion
Text → Encoder ─┐
Image → Encoder ─┼──> [Fusion (concat/avg)] → Joint Embedding
Audio → Encoder ─┘
3) Late Fusion
Text → Encoder → Classifier ─┐
Image → Encoder → Classifier ─┼─> [Voting / Weighted Avg] → Final Decision
Audio → Encoder → Classifier ─┘
4) Attention-Based Fusion (advanced intermediate)
Text → Encoder ──> [Text Vec] ──┐
│
Image → Encoder ─> [Image Vec] ─┼─> [Cross-Modal Attention] → Joint Embedding
│
Audio → Encoder ─> [Audio Vec] ─┘
(Queries attend to relevant Keys/Values across modalities)
This is why a multimodal model can take an image of a dog and the text “a fluffy puppy” and understand that they refer to the same thing—because the embeddings for both inputs are very similar in the model’s learned space.
And the “multimodal backbone” is where that fused data is processed and understood.
This is the simplest of the fusion processes. The model’s job is to align a single text prompt with a single, noisy image.
The model takes the embedding for your text prompt (e.g., “a cat with a pirate hat”) and the embedding for a single, noisy image. It uses a cross-attention mechanism to meticulously connect the concepts in the text to the visual features it needs to generate.
This is a different kind of fusion, focused on modifying an existing image based on a text prompt.
The model takes two inputs: an image embedding and a text embedding. It fuses them to understand the desired change. For example, if you input an image of a dog and the text “add sunglasses,” the model aligns the “sunglasses” text with the “dog’s eyes” region of the image.
The fusion here is a collaborative effort between the existing image data and the new text instructions. This is sometimes referred to as “image-to-image with text conditioning.”
This is where the complexity explodes. The model doesn’t just have to align text with a single image; it has to align the text with every frame of a video while also ensuring all the frames are connected in a logical sequence.
This process is a sophisticated two-part alignment, using Spatial Attention (for internal frame consistency) and Temporal Attention (for fluid motion and transitions).
It’s the opposite of text-to-speech.
A model takes two inputs: an audio stream (a person speaking) and a text prompt (often a simple instruction or context). The model fuses these two sources to achieve a more accurate result than a simple speech-to-text model.
For example, if you’re in a noisy room and say “I need to call my friend, a developer named Pythonesque,” a single-modality model might struggle with that unique word. But a multimodal model could use the textual context “developer” to correctly transcribe the name.
Think of a live meeting or a video conference. The model simultaneously processes the audio of the speakers, the visual cues from their faces (e.g., lip movements), and a text stream (the live transcript being generated).
It fuses all this information to create a more accurate and reliable text output. The facial expressions and lip movements act as visual cues to help the model distinguish between similar-sounding words or correct for noise in the audio feed.
After all the inputs are encoded and the information is fused, the model uses its comprehensive, unified understanding to create new content.
This step is often handled by a “decoder” or a generative model that specializes in producing the desired output format. The final output is not just a direct translation. The model uses its integrated, multimodal knowledge to creatively synthesize a new, coherent output.
When the goal is to produce a text output, the model often uses a process similar to a traditional Large Language Model (LLM).
The fused, multimodal embedding (the unified representation of the image and text prompt) is fed into a language decoder. This decoder then uses an autoregressive process, meaning it generates the output word by word, or “token by token.” It predicts the most likely next word based on the combined input and the words it has already generated.
When the task is to generate an image from a text prompt, the model uses a different type of generative model, typically a diffusion model or a Generative Adversarial Network (GAN).
The fused multimodal embedding is used to seed a generative model in a “latent space” (a high-dimensional space where concepts are represented). This embedding acts as a blueprint or a set of constraints for the image.
The model starts with a pure noise image in this latent space. It then uses its learned “denoising” process to gradually refine the image, using the input embedding as a guide. The model “denoises” the image in a way that aligns with the concepts in the text prompt (e.g., “a golden retriever puppy”).
After the diffusion process, the model outputs a sequence of frames that, when played in order, form a video. This process can be broken down into a few main approaches:
Frame-by-Frame: Older methods generated one frame at a time, using the previous frames as a reference. This was prone to “temporal drift,” where the content would slowly lose coherence over the course of the video.
Parallel Generation: Newer, more advanced models (like Google’s Veo and others) can generate multiple frames or even the entire video at once. This significantly improves temporal consistency and overall video quality by considering the entire sequence simultaneously.
The fused embedding (from the text prompt and any other conditional input like a desired voice) is fed into a specialized audio decoder. This decoder uses generative techniques to synthesize the audio waveform.
Advanced models can control for things like tone, emotion, pitch, and even mimic a specific person’s voice (voice cloning) based on a small audio sample.
We’ve journeyed from the single-purpose brilliance of LLMs and diffusion models to the integrated power of multimodal AI. The shift from “one model, one task” to “one model, many senses” is not merely an incremental upgrade—it’s a fundamental change in how AI understands and interacts with our world.
The core of this revolution lies in the ability to fuse disparate pieces of information. Whether it’s a model aligning a text prompt with a single image, a series of video frames, or a complex blend of audio and video cues, the goal is the same: to create a single, unified comprehension that mirrors how our own brains work.