Understanding Multimodal Embeddings: The Evolution from CLIP to Unified Foundation Models
The field of artificial intelligence is experiencing a fundamental transformation in how machines understand and connect different types of data—text, images, audio, and video. Between 2023 and 2025, we've witnessed a seismic shift from specialized, isolated embedding models to unified foundation models that can seamlessly process and align multiple modalities. This evolution is not just about incremental improvements; it represents a fundamental change in how AI systems perceive and reason about the sensory world.
If you're building RAG (Retrieval-Augmented Generation) systems, multimodal search engines, or any application that needs to connect different types of media, understanding this landscape is crucial. The stakes are high: the difference between a CLIP-style dual encoder and a modern instruction-tuned model like Omni-Embed can mean the difference between 60% and 95% accuracy in cross-modal retrieval tasks.
Contents
The CLIP Era: Contrastive Learning and Its Limitations
How CLIP Changed Everything
When OpenAI introduced CLIP (Contrastive Language-Image Pre-training) in 2021, it revolutionized how we think about multimodal AI. The core innovation was deceptively simple: train two neural networks (one for images, one for text) to map semantically similar content to nearby points in a shared vector space.
What CLIP does well:
- Zero-shot classification: It can classify images into categories it's never explicitly trained on by comparing image embeddings to text descriptions
- Semantic search: Given a text query, it can find relevant images from a database
- Transfer learning: CLIP embeddings work remarkably well as features for downstream tasks
Where CLIP falls short:
- Coarse-grained understanding: CLIP excels at recognizing "a dog" but struggles with fine-grained distinctions like "a golden retriever puppy sitting on a blue couch"
- Static embeddings: The same image always produces the same embedding, regardless of what aspect you're interested in
- Information density mismatch: A 30-second audio clip contains far more information than a 5-word caption, but CLIP-style models treat them as equivalent
- The modality gap: Even when trained together, embeddings from different modalities cluster in distinct regions of the vector space
The Modality Gap Problem
Understanding the Geometric Challenge
The "modality gap" is one of the most persistent challenges in multimodal AI. Even in well-trained CLIP models, if you visualize the embedding space using techniques like t-SNE or UMAP, you'll observe that text embeddings occupy a cone-shaped region that's geometrically separate from image embeddings.
Why does this happen?
Discrete vs. Continuous: Text is inherently discrete (tokens), while images and audio are continuous signals (pixels, waveforms). This fundamental difference creates distinct statistical properties in their embeddings.
Training objective limitations: Contrastive learning only enforces that matched pairs are closer than unmatched pairs. It doesn't explicitly force the embeddings to occupy the same region of space.
Information density: Different modalities encode information at different rates. A single word can require dozens of pixels to represent visually, creating asymmetries in the embedding process.
Real-world impact:
When you search for "sunset over mountains" in a CLIP-based image database, the model needs to:
- Project your text query into text embedding space
- Cross the modality gap
- Find similar image embeddings
Each crossing introduces potential errors. The gap means that a text embedding with cosine similarity 0.85 to an image might actually be a better match than one with 0.90, depending on where it sits relative to the gap.
How Modern Models Address the Gap
2025's unified models tackle this in several ways:
1. Decoupled Modality Streams (Omni-Embed approach):
- Process vision and audio through separate encoders
- Prevent the dominant modality (usually text) from overwhelming others
- Late-stage fusion in the LLM backbone preserves modality-specific features
2. Instruction Tuning (VLM2Vec approach):
- Embeddings become conditional on both input and instruction
- The model learns to emphasize different features based on the query context
- Reduces reliance on fixed geometric relationships
3. Massive Scale Training:
- Training on billions of aligned pairs helps close the gap through sheer exposure
- Models learn the subtle transformations needed to bridge modalities
Instruction-Tuned Unified Transformers: The 2025 Breakthrough
The Paradigm Shift
The defining innovation of 2025 is promptable embeddings. Instead of generating a fixed vector for an input, modern models produce embeddings conditional on both the content and an instruction, allowing a single model to generate task-specific representations.
Why this matters:
Consider analyzing a product photo for an e-commerce site. Depending on your goal, you might want to:
- Find similar products (focus on object category and attributes)
- Find products in similar settings (focus on background and context)
- Find products with similar lighting (focus on illumination and mood)
With traditional CLIP, you get one embedding that tries to capture everything. With instruction-tuned models, you get a different embedding for each use case—all from the same image.
Implementation: How It Works
Modern instruction-tuned models like VLM2Vec-V2 work by:
- Appending a special pooling token to the input sequence (e.g.,
<|endoftext|>or<embed>) - Processing the full sequence through a Vision-Language Model (VLM) backbone
- Extracting the pooling token's hidden state as the embedding
- Training with contrastive loss on instruction-paired examples
The key insight: the LLM's "thought vector" at the pooling token position naturally incorporates both the visual/textual content and the instruction's intent.
Training data structure:
{
"instruction": "Retrieve the video that matches this description",
"query": "A chef preparing pasta in a modern kitchen",
"positive": <video_embedding>,
"negatives": [<video_embedding_1>, <video_embedding_2>, ...]
}Performance Impact
The MMEB-V2 (Massive Multimodal Embedding Benchmark V2) results show dramatic improvements:
| Model Type | Image-Text Retrieval | Video Retrieval | Document Understanding |
|---|---|---|---|
| CLIP-ViT-L | 72.3% | 45.8% | 38.2% |
| ImageBind | 68.7% | 51.3% | 42.1% |
| VLM2Vec-V2 (2B) | 82.4% | 67.9% | 71.3% |
| VLM2Vec-V2 (7B) | 87.2% | 73.6% | 78.9% |
The gap is especially pronounced for video retrieval, where the temporal dimension requires genuine reasoning—something that static embeddings struggle with.
State-of-the-Art Models Comparison
Unified Foundation Models
1. Omni-Embed-Nemotron (NVIDIA)
Architecture: Based on Qwen2.5-Omni (4.7B parameters)
Omni-Embed represents the "all-in-one" philosophy at its peak. NVIDIA's key innovation is architectural:
Decoupled Sensory Streams:
- Vision encoder processes images/video independently
- Audio encoder processes waveforms independently
- Text tokens processed by the LLM backbone
Unlike earlier approaches that interleaved all modalities at the input layer, Omni-Embed keeps them separate until late-stage fusion. This prevents "modality interference"—where the massive number of text tokens (thousands) would dilute the signal from sparse audio features (hundreds).
Training Scale:
- Text-Text: Standard retrieval corpora (HotpotQA, MIRACL, NV-Retriever datasets)
- Text-Image: 1B+ image-text pairs
- Text-Video: Specialized temporal alignment datasets
- Text-Audio: Environmental sounds and speech data
Best Use Cases:
- Any-to-Any retrieval: Find a video frame using an audio query
- Complex RAG: "Find the slide where the speaker discusses quarterly results" (audio → image retrieval)
- Multimodal search engines: Users can search with images, voice, or text interchangeably
Limitations:
- Large model size (4.7B parameters) requires significant compute
- Primarily designed for retrieval, not generation tasks
2. VLM2Vec-V2 (TIGER Lab)
Architecture: Instruction-tuned Qwen2-VL (2B and 7B variants)
VLM2Vec-V2 takes a different approach: transform existing Vision-Language Models into embedding models through instruction tuning.
Key Innovation - Visual Document Training:
The research revealed a surprising finding: training on Visual Documents (PDFs, slides, infographics) significantly improves performance across all vision tasks, not just document understanding.
Why? Visual documents contain:
- Structured spatial layouts (tables, columns, diagrams)
- Fine-grained text-image relationships (captions, labels)
- Hierarchical information organization
These properties force the model to learn precise spatial reasoning that transfers well to general image understanding.
Training Mix Results:
- Image only: 76.3% average score
- Image + Video: 79.8%
- Image + VisDoc: 83.1%
- Image + VisDoc + Video: 85.7% ← Best performance
Video Handling Strategy:
To manage computational costs, VLM2Vec-V2 uses uniform frame sampling—extracting 8 frames per video clip. While sparse, the strong Qwen2-VL backbone can extract semantic meaning effectively.
Best Use Cases:
- Document understanding: OCR-free document retrieval and question answering
- Video search: Natural language queries to find specific video segments
- Educational content: Matching lecture slides to video timestamps
Limitations:
- Video processing is sparse (8 frames) - may miss rapid actions
- Requires instruction engineering for optimal performance
3. Amazon Nova Multimodal Embeddings
Architecture: Proprietary (details not fully disclosed)
Nova is AWS's enterprise-focused embedding model, designed specifically for production RAG systems.
Unified Semantic Space:
Nova's standout feature is its strictly unified latent space. Unlike models where different modalities cluster separately, Nova enforces true unification through its training objective.
"Concept Mixing" Capability:
This enables queries like:
- Image of a hiking boot + "lighter materials" → Retrieved lightweight boots
- Product photo + "less expensive alternatives" → Retrieved budget options
- Video clip + "similar but shorter" → Retrieved condensed versions
Best Use Cases:
- E-commerce: Hybrid image+text product search
- Media archives: Content discovery across modalities
- Enterprise knowledge bases: Unified search across documents, videos, presentations
Limitations:
- Proprietary/closed source
- AWS vendor lock-in
- Pricing based on AWS Bedrock usage
Model Selection Framework
Here's a decision framework for choosing the right model:
Do you need audio processing?
├─ YES → Omni-Embed-Nemotron
│ (Best-in-class audio-text alignment)
│
└─ NO → Do you need document understanding?
├─ YES → VLM2Vec-V2 or Amazon Nova
│ (Strong visual document performance)
│
└─ NO → What's your deployment constraint?
├─ Cloud/High-compute → VLM2Vec-V2 (7B)
│ (Best overall accuracy)
│
├─ Edge/Mobile → VLM2Vec-V2 (2B)
│ (Best accuracy-per-parameter)
│
└─ Enterprise/AWS → Amazon Nova
(Managed service, production-ready)The Binding Hypothesis: ImageBind vs LanguageBind
The Debate: What Should Be the Universal Anchor?
An alternative to training a single unified model is to "bind" specialized encoders together through a common anchor modality. This leads to a fundamental question: What modality should serve as the universal connector?
ImageBind (Meta AI): Images as the Anchor
Core Hypothesis: Visual information is the most universal sensory modality. Humans learn about the world primarily through vision, and most other modalities (audio, touch, depth) have natural visual correlates.
Architecture:
- Freeze a massive pre-trained image encoder (ViT-Giant)
- Train separate encoders for Audio, Depth, Thermal, IMU (Inertial Measurement Unit), and Text
- Align each modality to the frozen image space using contrastive learning
Emergent Alignment:
The powerful property of ImageBind is transitive alignment. If:
- Audio ↔ Image (learned)
- Text ↔ Image (learned)
Then Audio ↔ Text emerges automatically, without ever training on audio-text pairs.
Practical Performance:
- ✅ Strong zero-shot classification across modalities
- ✅ Excellent for vision-centric tasks (robotics, navigation)
- ❌ Weaker for purely linguistic reasoning
- ❌ Trails supervised models in fine-grained audio-text retrieval
LanguageBind (PKU): Language as the Anchor
Core Hypothesis: Language is the most semantically rich modality. It contains explicit descriptions, abstract concepts, and reasoning chains that visual information lacks.
Architecture:
- Use a text encoder as the anchor
- Align Video, Audio, Depth, Infrared, and 3D point clouds to text space
- Generate synthetic text descriptions for modalities lacking natural captions
The Synthetic Data Pipeline:
LanguageBind's breakthrough is using generative VLMs to create text descriptions for data lacking natural captions:
Infrared image → VLM → "Thermal signature showing heat dispersion pattern"
Depth map → VLM → "3D surface with peaks in the center and valleys on edges"This creates the VIDAL-10M dataset (Video-Infrared-Depth-Audio-Language, 10 Million samples).
Performance Advantage:
In zero-shot video-text retrieval benchmarks (MSR-VTT, DiDeMo):
- ImageBind: 34.2% Recall@10
- LanguageBind: 42.7% Recall@10
The language anchor allows more precise semantic queries like "a person hesitating before jumping" (temporal action with intent) compared to image-based anchors that struggle with abstract concepts.
The Verdict: Unified Models Win
While both binding approaches are theoretically elegant, empirical results from 2025 show that fully unified models (Omni-Embed, VLM2Vec-V2) outperform binding architectures:
| Approach | Zero-Shot Flexibility | Fine-Grained Accuracy | Training Complexity |
|---|---|---|---|
| ImageBind | ✅ High | ❌ Medium | ✅ Lower |
| LanguageBind | ✅ High | ✅ Medium-High | ⚠️ Medium |
| Unified (Omni/VLM2Vec) | ⚠️ Medium | ✅ Very High | ❌ Higher |
Why unified wins:
- Joint optimization: Training all modalities together allows the model to learn the interaction between modalities, not just pairwise alignments
- Task-specific tuning: Unified models can be fine-tuned for specific retrieval tasks with better efficiency
- No information bottleneck: Binding approaches force all information through the anchor modality, potentially losing modality-specific features
Specialized Audio-Text Models
While unified models handle multiple modalities, specialized audio-text models offer higher fidelity for pure audio understanding tasks. The 2025 landscape is defined by the Massive Sound Embedding Benchmark (MSEB), which evaluates audio models across 8 super-tasks.
MSEB: The Audio Embedding Standard
8 Super-Tasks:
- Retrieval: Find audio clips matching a text description
- Reranking: Order candidates by relevance to a query
- Reasoning: Answer questions about audio scenes ("What happened first?")
- Classification: Categorize sounds into classes
- Transcription: Convert speech to text
- Segmentation: Identify temporal boundaries between events
- Clustering: Group similar sounds without labels
- Reconstruction: Generate audio from embeddings
CLAP (Contrastive Language-Audio Pretraining)
Architecture: Dual encoder (HTSAT for audio, BERT for text)
CLAP is the audio analog of CLIP, using contrastive learning to align audio waveforms and text descriptions.
Strengths:
- ✅ Excellent at Classification: Identifying environmental sounds (dog barking, car horn, rain)
- ✅ Good Retrieval performance on simple queries
- ✅ Efficient inference (separate encoders can be cached)
Weaknesses (Revealed by MSEB):
- ❌ Poor Organization: Struggles with clustering similar sounds without supervision
- ❌ Weak Reconstruction: Can't reliably reconstruct audio from embeddings
- ❌ Superficial associations: Often matches spectral features rather than semantic meaning
Example failure case:
- Query: "Heavy rain on a metal roof"
- CLAP retrieves: "Frying bacon in a pan"
- Why? Both have similar high-frequency crackling patterns
CLAP lacks the "reasoning" to distinguish that one is weather (outdoor, sustained) while the other is cooking (indoor, active process).
AudioCLIP
Architecture: Tri-modal (Image-Text-Audio) extension of CLIP
AudioCLIP adds an audio head (ESResNeXt CNN) to a pre-trained CLIP model, achieving strong performance on Environmental Sound Classification (ESC-50: 96.2% accuracy).
Advantages over pure CLAP:
- The image modality provides additional grounding for ambiguous sounds
- Can handle queries like "Find audio similar to this image" (e.g., finding thunderstorm sounds from a storm photo)
Limitations:
- Less effective at processing linguistic content in speech compared to newer models
- The image encoder sometimes introduces noise when audio-text pairs are sufficient
The Future: Unified Audio in LLMs
The trajectory is clear: specialized audio models are being absorbed into unified LLMs. Models like Qwen-Audio (mentioned in the research) can process speech directly as tokens, allowing:
- True understanding of linguistic content in audio
- Reasoning about audio scenes using the LLM's world knowledge
- Joint text-audio generation tasks
Choosing the Right Model for Your Use Case
Decision Matrix
| Your Use Case | Recommended Model | Key Considerations |
|---|---|---|
| Multimodal RAG (documents, images, videos) | VLM2Vec-V2 (7B) or Amazon Nova | Need instruction tuning for diverse queries |
| E-commerce search (image + text filters) | Amazon Nova or Omni-Embed | Concept mixing capabilities crucial |
| Audio search (podcasts, sound effects) | Omni-Embed-Nemotron | Best audio-text alignment available |
| Video understanding (action recognition) | VLM2Vec-V2 or InternVideo2 | Temporal reasoning required |
| Document retrieval (PDFs, slides) | VLM2Vec-V2 | Visual document training data advantage |
| Low-latency edge deployment | VLM2Vec-V2 (2B) | Best accuracy-per-parameter ratio |
| Zero-shot classification | ImageBind or LanguageBind | No fine-tuning required |
| Environmental sound classification | CLAP or AudioCLIP | Specialized for ESC tasks |
Practical Implementation Considerations
1. Model Size vs. Performance Trade-off:
# Inference time comparison (A100 GPU)
VLM2Vec-V2 (2B): ~50ms per batch (32 images)
VLM2Vec-V2 (7B): ~180ms per batch (32 images)
Omni-Embed (4.7B): ~120ms per batch (32 images + audio)
# Accuracy gain
VLM2Vec 2B → 7B: +5.8% on MMEB-V2
Cost: 3.6x slower inference2. Instruction Engineering:
For instruction-tuned models, the quality of your instruction prompt significantly impacts results:
Bad instruction:
"Find similar items"Good instruction:
"Retrieve products that match the visual style, color scheme, and design aesthetic of the input image, prioritizing items from the same category"3. Fine-tuning vs. Prompting:
- Use prompting when: You have diverse tasks, limited labeled data, need rapid iteration
- Use fine-tuning when: You have 10K+ labeled examples, single specialized task, can afford training time
4. Hybrid Search Strategies:
Don't rely solely on embeddings. Combine with:
- Metadata filtering: Filter by date, category, author before vector search
- Keyword search: BM25 for exact matches, embeddings for semantic similarity
- Reranking: Use a larger model to rerank top-K results from a faster model
Cost Analysis
Training costs (estimated on cloud infrastructure):
| Model | Training Data | GPU Hours | Estimated Cost |
|---|---|---|---|
| CLIP-style dual encoder | 400M pairs | ~5K hours (A100) | $15K-$25K |
| VLM2Vec-V2 (2B) fine-tuning | 50M instruction pairs | ~2K hours | $6K-$10K |
| Omni-Embed from scratch | 1B+ mixed pairs | ~20K hours | $60K-$100K |
Inference costs (per 1M embeddings):
- VLM2Vec-V2 (2B) on A100: ~$2-3
- VLM2Vec-V2 (7B) on A100: ~$8-12
- Amazon Nova (Bedrock): ~$15-25 (managed service premium)
The Road Ahead
The evolution from CLIP to unified foundation models represents more than architectural progress—it's a fundamental shift in how AI systems understand the world. The key takeaways:
- Instruction tuning is the new paradigm: Static embeddings are being replaced by dynamic, promptable representations
- Unified models outperform binding: Joint training across modalities yields better results than pairwise alignment
- Visual documents are underrated: Training on PDFs and slides improves general vision understanding
- Specialization still matters: For pure audio or video tasks, specialized models can still offer advantages
In the next post, we'll dive deep into video embeddings—exploring how models handle the fourth dimension (time) and the specific challenges of temporal reasoning. We'll examine VideoMAE V2's dual masking strategy and why simple frame averaging fails for action recognition.
Get in Touch
Need help implementing multimodal embeddings in your AI applications? Want to discuss RAG architecture strategies or custom model fine-tuning?
Connect with me:
- 📧 Email: [email protected]
- 🐦 Twitter/X: @TheDataGuyPro
- 💼 LinkedIn: Muhammad Afzaal
- 💻 GitHub: @mafzaal
- 🎥 YouTube: @TheDataGuyPro
- 🎧 Podcast: TheDataGuy Show
Whether you're looking for consulting services, training, or just want to discuss multimodal AI architecture, I'd love to hear from you!
References
Liang, V. W., Zhang, Y., Kwon, Y., Yeung, S., & Zou, J. Y. (2022). Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning. arXiv preprint arXiv:2203.02053.
Wu, J., et al. (2024). VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents. arXiv preprint arXiv:2507.04590.
NVIDIA. (2024). Omni-Embed-Nemotron: A Unified Multimodal Retrieval Model for Text, Image, Audio, and Video. Hugging Face Model Hub.
Radford, A., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML 2021.
Deshmukh, S., et al. (2024). Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities. arXiv preprint arXiv:2503.03983.
Girdhar, R., et al. (2023). ImageBind: One Embedding Space To Bind Them All. CVPR 2023.
Zhu, L., et al. (2023). LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment. ICLR 2024.
Amazon Web Services. (2024). Amazon Nova Multimodal Embeddings: State-of-the-art embedding model for agentic RAG and semantic search. AWS News Blog.
Deshmukh, S., et al. (2024). From Waveforms to Wisdom: The New Benchmark for Auditory Intelligence. Google Research Blog.
Guzhov, A., et al. (2021). AudioCLIP: Extending CLIP to Image, Text and Audio. arXiv preprint arXiv:2106.13043.
TIGER Lab. (2024). MMEB-V2: Massive Multimodal Embedding Benchmark. GitHub Repository.
Zhai, X., et al. (2023). Sigmoid Loss for Language Image Pre-Training. ICCV 2023.