Back to Writing Understanding Multimodal Embeddings: The Evolution from CLIP to Unified Foundation Models

Understanding Multimodal Embeddings: The Evolution from CLIP to Unified Foundation Models

The field of artificial intelligence is experiencing a fundamental transformation in how machines understand and connect different types of data—text, images, audio, and video. Between 2023 and 2025, we've witnessed a seismic shift from specialized, isolated embedding models to unified foundation models that can seamlessly process and align multiple modalities. This evolution is not just about incremental improvements; it represents a fundamental change in how AI systems perceive and reason about the sensory world.

If you're building RAG (Retrieval-Augmented Generation) systems, multimodal search engines, or any application that needs to connect different types of media, understanding this landscape is crucial. The stakes are high: the difference between a CLIP-style dual encoder and a modern instruction-tuned model like Omni-Embed can mean the difference between 60% and 95% accuracy in cross-modal retrieval tasks.

Contents

The CLIP Era: Contrastive Learning and Its Limitations

How CLIP Changed Everything

When OpenAI introduced CLIP (Contrastive Language-Image Pre-training) in 2021, it revolutionized how we think about multimodal AI. The core innovation was deceptively simple: train two neural networks (one for images, one for text) to map semantically similar content to nearby points in a shared vector space.

What CLIP does well:

  • Zero-shot classification: It can classify images into categories it's never explicitly trained on by comparing image embeddings to text descriptions
  • Semantic search: Given a text query, it can find relevant images from a database
  • Transfer learning: CLIP embeddings work remarkably well as features for downstream tasks

Where CLIP falls short:

  1. Coarse-grained understanding: CLIP excels at recognizing "a dog" but struggles with fine-grained distinctions like "a golden retriever puppy sitting on a blue couch"
  2. Static embeddings: The same image always produces the same embedding, regardless of what aspect you're interested in
  3. Information density mismatch: A 30-second audio clip contains far more information than a 5-word caption, but CLIP-style models treat them as equivalent
  4. The modality gap: Even when trained together, embeddings from different modalities cluster in distinct regions of the vector space

The Modality Gap Problem

Understanding the Geometric Challenge

The "modality gap" is one of the most persistent challenges in multimodal AI. Even in well-trained CLIP models, if you visualize the embedding space using techniques like t-SNE or UMAP, you'll observe that text embeddings occupy a cone-shaped region that's geometrically separate from image embeddings.

Why does this happen?

  1. Discrete vs. Continuous: Text is inherently discrete (tokens), while images and audio are continuous signals (pixels, waveforms). This fundamental difference creates distinct statistical properties in their embeddings.

  2. Training objective limitations: Contrastive learning only enforces that matched pairs are closer than unmatched pairs. It doesn't explicitly force the embeddings to occupy the same region of space.

  3. Information density: Different modalities encode information at different rates. A single word can require dozens of pixels to represent visually, creating asymmetries in the embedding process.

Real-world impact:

When you search for "sunset over mountains" in a CLIP-based image database, the model needs to:

  1. Project your text query into text embedding space
  2. Cross the modality gap
  3. Find similar image embeddings

Each crossing introduces potential errors. The gap means that a text embedding with cosine similarity 0.85 to an image might actually be a better match than one with 0.90, depending on where it sits relative to the gap.

How Modern Models Address the Gap

2025's unified models tackle this in several ways:

1. Decoupled Modality Streams (Omni-Embed approach):

  • Process vision and audio through separate encoders
  • Prevent the dominant modality (usually text) from overwhelming others
  • Late-stage fusion in the LLM backbone preserves modality-specific features

2. Instruction Tuning (VLM2Vec approach):

  • Embeddings become conditional on both input and instruction
  • The model learns to emphasize different features based on the query context
  • Reduces reliance on fixed geometric relationships

3. Massive Scale Training:

  • Training on billions of aligned pairs helps close the gap through sheer exposure
  • Models learn the subtle transformations needed to bridge modalities

Instruction-Tuned Unified Transformers: The 2025 Breakthrough

The Paradigm Shift

The defining innovation of 2025 is promptable embeddings. Instead of generating a fixed vector for an input, modern models produce embeddings conditional on both the content and an instruction, allowing a single model to generate task-specific representations.

Why this matters:

Consider analyzing a product photo for an e-commerce site. Depending on your goal, you might want to:

  • Find similar products (focus on object category and attributes)
  • Find products in similar settings (focus on background and context)
  • Find products with similar lighting (focus on illumination and mood)

With traditional CLIP, you get one embedding that tries to capture everything. With instruction-tuned models, you get a different embedding for each use case—all from the same image.

Implementation: How It Works

Modern instruction-tuned models like VLM2Vec-V2 work by:

  1. Appending a special pooling token to the input sequence (e.g., <|endoftext|> or <embed>)
  2. Processing the full sequence through a Vision-Language Model (VLM) backbone
  3. Extracting the pooling token's hidden state as the embedding
  4. Training with contrastive loss on instruction-paired examples

The key insight: the LLM's "thought vector" at the pooling token position naturally incorporates both the visual/textual content and the instruction's intent.

Training data structure:

{
    "instruction": "Retrieve the video that matches this description",
    "query": "A chef preparing pasta in a modern kitchen",
    "positive": <video_embedding>,
    "negatives": [<video_embedding_1>, <video_embedding_2>, ...]
}

Performance Impact

The MMEB-V2 (Massive Multimodal Embedding Benchmark V2) results show dramatic improvements:

Model Type Image-Text Retrieval Video Retrieval Document Understanding
CLIP-ViT-L 72.3% 45.8% 38.2%
ImageBind 68.7% 51.3% 42.1%
VLM2Vec-V2 (2B) 82.4% 67.9% 71.3%
VLM2Vec-V2 (7B) 87.2% 73.6% 78.9%

The gap is especially pronounced for video retrieval, where the temporal dimension requires genuine reasoning—something that static embeddings struggle with.

State-of-the-Art Models Comparison

Unified Foundation Models

1. Omni-Embed-Nemotron (NVIDIA)

Architecture: Based on Qwen2.5-Omni (4.7B parameters)

Omni-Embed represents the "all-in-one" philosophy at its peak. NVIDIA's key innovation is architectural:

Decoupled Sensory Streams:

  • Vision encoder processes images/video independently
  • Audio encoder processes waveforms independently
  • Text tokens processed by the LLM backbone

Unlike earlier approaches that interleaved all modalities at the input layer, Omni-Embed keeps them separate until late-stage fusion. This prevents "modality interference"—where the massive number of text tokens (thousands) would dilute the signal from sparse audio features (hundreds).

Training Scale:

  • Text-Text: Standard retrieval corpora (HotpotQA, MIRACL, NV-Retriever datasets)
  • Text-Image: 1B+ image-text pairs
  • Text-Video: Specialized temporal alignment datasets
  • Text-Audio: Environmental sounds and speech data

Best Use Cases:

  • Any-to-Any retrieval: Find a video frame using an audio query
  • Complex RAG: "Find the slide where the speaker discusses quarterly results" (audio → image retrieval)
  • Multimodal search engines: Users can search with images, voice, or text interchangeably

Limitations:

  • Large model size (4.7B parameters) requires significant compute
  • Primarily designed for retrieval, not generation tasks

2. VLM2Vec-V2 (TIGER Lab)

Architecture: Instruction-tuned Qwen2-VL (2B and 7B variants)

VLM2Vec-V2 takes a different approach: transform existing Vision-Language Models into embedding models through instruction tuning.

Key Innovation - Visual Document Training:

The research revealed a surprising finding: training on Visual Documents (PDFs, slides, infographics) significantly improves performance across all vision tasks, not just document understanding.

Why? Visual documents contain:

  • Structured spatial layouts (tables, columns, diagrams)
  • Fine-grained text-image relationships (captions, labels)
  • Hierarchical information organization

These properties force the model to learn precise spatial reasoning that transfers well to general image understanding.

Training Mix Results:

  • Image only: 76.3% average score
  • Image + Video: 79.8%
  • Image + VisDoc: 83.1%
  • Image + VisDoc + Video: 85.7% ← Best performance

Video Handling Strategy:

To manage computational costs, VLM2Vec-V2 uses uniform frame sampling—extracting 8 frames per video clip. While sparse, the strong Qwen2-VL backbone can extract semantic meaning effectively.

Best Use Cases:

  • Document understanding: OCR-free document retrieval and question answering
  • Video search: Natural language queries to find specific video segments
  • Educational content: Matching lecture slides to video timestamps

Limitations:

  • Video processing is sparse (8 frames) - may miss rapid actions
  • Requires instruction engineering for optimal performance

3. Amazon Nova Multimodal Embeddings

Architecture: Proprietary (details not fully disclosed)

Nova is AWS's enterprise-focused embedding model, designed specifically for production RAG systems.

Unified Semantic Space:

Nova's standout feature is its strictly unified latent space. Unlike models where different modalities cluster separately, Nova enforces true unification through its training objective.

"Concept Mixing" Capability:

This enables queries like:

  • Image of a hiking boot + "lighter materials" → Retrieved lightweight boots
  • Product photo + "less expensive alternatives" → Retrieved budget options
  • Video clip + "similar but shorter" → Retrieved condensed versions

Best Use Cases:

  • E-commerce: Hybrid image+text product search
  • Media archives: Content discovery across modalities
  • Enterprise knowledge bases: Unified search across documents, videos, presentations

Limitations:

  • Proprietary/closed source
  • AWS vendor lock-in
  • Pricing based on AWS Bedrock usage

Model Selection Framework

Here's a decision framework for choosing the right model:

Do you need audio processing?
├─ YES → Omni-Embed-Nemotron
│         (Best-in-class audio-text alignment)

└─ NO → Do you need document understanding?
         ├─ YES → VLM2Vec-V2 or Amazon Nova
         │         (Strong visual document performance)

         └─ NO → What's your deployment constraint?
                  ├─ Cloud/High-compute → VLM2Vec-V2 (7B)
                  │                       (Best overall accuracy)

                  ├─ Edge/Mobile → VLM2Vec-V2 (2B)
                  │                 (Best accuracy-per-parameter)

                  └─ Enterprise/AWS → Amazon Nova
                                      (Managed service, production-ready)

The Binding Hypothesis: ImageBind vs LanguageBind

The Debate: What Should Be the Universal Anchor?

An alternative to training a single unified model is to "bind" specialized encoders together through a common anchor modality. This leads to a fundamental question: What modality should serve as the universal connector?

ImageBind (Meta AI): Images as the Anchor

Core Hypothesis: Visual information is the most universal sensory modality. Humans learn about the world primarily through vision, and most other modalities (audio, touch, depth) have natural visual correlates.

Architecture:

  1. Freeze a massive pre-trained image encoder (ViT-Giant)
  2. Train separate encoders for Audio, Depth, Thermal, IMU (Inertial Measurement Unit), and Text
  3. Align each modality to the frozen image space using contrastive learning

Emergent Alignment:

The powerful property of ImageBind is transitive alignment. If:

  • Audio ↔ Image (learned)
  • Text ↔ Image (learned)

Then Audio ↔ Text emerges automatically, without ever training on audio-text pairs.

Practical Performance:

  • ✅ Strong zero-shot classification across modalities
  • ✅ Excellent for vision-centric tasks (robotics, navigation)
  • ❌ Weaker for purely linguistic reasoning
  • ❌ Trails supervised models in fine-grained audio-text retrieval

LanguageBind (PKU): Language as the Anchor

Core Hypothesis: Language is the most semantically rich modality. It contains explicit descriptions, abstract concepts, and reasoning chains that visual information lacks.

Architecture:

  1. Use a text encoder as the anchor
  2. Align Video, Audio, Depth, Infrared, and 3D point clouds to text space
  3. Generate synthetic text descriptions for modalities lacking natural captions

The Synthetic Data Pipeline:

LanguageBind's breakthrough is using generative VLMs to create text descriptions for data lacking natural captions:

Infrared image → VLM → "Thermal signature showing heat dispersion pattern"
Depth map → VLM → "3D surface with peaks in the center and valleys on edges"

This creates the VIDAL-10M dataset (Video-Infrared-Depth-Audio-Language, 10 Million samples).

Performance Advantage:

In zero-shot video-text retrieval benchmarks (MSR-VTT, DiDeMo):

  • ImageBind: 34.2% Recall@10
  • LanguageBind: 42.7% Recall@10

The language anchor allows more precise semantic queries like "a person hesitating before jumping" (temporal action with intent) compared to image-based anchors that struggle with abstract concepts.

The Verdict: Unified Models Win

While both binding approaches are theoretically elegant, empirical results from 2025 show that fully unified models (Omni-Embed, VLM2Vec-V2) outperform binding architectures:

Approach Zero-Shot Flexibility Fine-Grained Accuracy Training Complexity
ImageBind ✅ High ❌ Medium ✅ Lower
LanguageBind ✅ High ✅ Medium-High ⚠️ Medium
Unified (Omni/VLM2Vec) ⚠️ Medium ✅ Very High ❌ Higher

Why unified wins:

  1. Joint optimization: Training all modalities together allows the model to learn the interaction between modalities, not just pairwise alignments
  2. Task-specific tuning: Unified models can be fine-tuned for specific retrieval tasks with better efficiency
  3. No information bottleneck: Binding approaches force all information through the anchor modality, potentially losing modality-specific features

Specialized Audio-Text Models

While unified models handle multiple modalities, specialized audio-text models offer higher fidelity for pure audio understanding tasks. The 2025 landscape is defined by the Massive Sound Embedding Benchmark (MSEB), which evaluates audio models across 8 super-tasks.

MSEB: The Audio Embedding Standard

8 Super-Tasks:

  1. Retrieval: Find audio clips matching a text description
  2. Reranking: Order candidates by relevance to a query
  3. Reasoning: Answer questions about audio scenes ("What happened first?")
  4. Classification: Categorize sounds into classes
  5. Transcription: Convert speech to text
  6. Segmentation: Identify temporal boundaries between events
  7. Clustering: Group similar sounds without labels
  8. Reconstruction: Generate audio from embeddings

CLAP (Contrastive Language-Audio Pretraining)

Architecture: Dual encoder (HTSAT for audio, BERT for text)

CLAP is the audio analog of CLIP, using contrastive learning to align audio waveforms and text descriptions.

Strengths:

  • ✅ Excellent at Classification: Identifying environmental sounds (dog barking, car horn, rain)
  • ✅ Good Retrieval performance on simple queries
  • ✅ Efficient inference (separate encoders can be cached)

Weaknesses (Revealed by MSEB):

  • Poor Organization: Struggles with clustering similar sounds without supervision
  • Weak Reconstruction: Can't reliably reconstruct audio from embeddings
  • Superficial associations: Often matches spectral features rather than semantic meaning

Example failure case:

  • Query: "Heavy rain on a metal roof"
  • CLAP retrieves: "Frying bacon in a pan"
  • Why? Both have similar high-frequency crackling patterns

CLAP lacks the "reasoning" to distinguish that one is weather (outdoor, sustained) while the other is cooking (indoor, active process).

AudioCLIP

Architecture: Tri-modal (Image-Text-Audio) extension of CLIP

AudioCLIP adds an audio head (ESResNeXt CNN) to a pre-trained CLIP model, achieving strong performance on Environmental Sound Classification (ESC-50: 96.2% accuracy).

Advantages over pure CLAP:

  • The image modality provides additional grounding for ambiguous sounds
  • Can handle queries like "Find audio similar to this image" (e.g., finding thunderstorm sounds from a storm photo)

Limitations:

  • Less effective at processing linguistic content in speech compared to newer models
  • The image encoder sometimes introduces noise when audio-text pairs are sufficient

The Future: Unified Audio in LLMs

The trajectory is clear: specialized audio models are being absorbed into unified LLMs. Models like Qwen-Audio (mentioned in the research) can process speech directly as tokens, allowing:

  • True understanding of linguistic content in audio
  • Reasoning about audio scenes using the LLM's world knowledge
  • Joint text-audio generation tasks

Choosing the Right Model for Your Use Case

Decision Matrix

Your Use Case Recommended Model Key Considerations
Multimodal RAG (documents, images, videos) VLM2Vec-V2 (7B) or Amazon Nova Need instruction tuning for diverse queries
E-commerce search (image + text filters) Amazon Nova or Omni-Embed Concept mixing capabilities crucial
Audio search (podcasts, sound effects) Omni-Embed-Nemotron Best audio-text alignment available
Video understanding (action recognition) VLM2Vec-V2 or InternVideo2 Temporal reasoning required
Document retrieval (PDFs, slides) VLM2Vec-V2 Visual document training data advantage
Low-latency edge deployment VLM2Vec-V2 (2B) Best accuracy-per-parameter ratio
Zero-shot classification ImageBind or LanguageBind No fine-tuning required
Environmental sound classification CLAP or AudioCLIP Specialized for ESC tasks

Practical Implementation Considerations

1. Model Size vs. Performance Trade-off:

# Inference time comparison (A100 GPU)
VLM2Vec-V2 (2B):  ~50ms per batch (32 images)
VLM2Vec-V2 (7B):  ~180ms per batch (32 images)
Omni-Embed (4.7B): ~120ms per batch (32 images + audio)

# Accuracy gain
VLM2Vec 2B7B: +5.8% on MMEB-V2
Cost: 3.6x slower inference

2. Instruction Engineering:

For instruction-tuned models, the quality of your instruction prompt significantly impacts results:

Bad instruction:

"Find similar items"

Good instruction:

"Retrieve products that match the visual style, color scheme, and design aesthetic of the input image, prioritizing items from the same category"

3. Fine-tuning vs. Prompting:

  • Use prompting when: You have diverse tasks, limited labeled data, need rapid iteration
  • Use fine-tuning when: You have 10K+ labeled examples, single specialized task, can afford training time

4. Hybrid Search Strategies:

Don't rely solely on embeddings. Combine with:

  • Metadata filtering: Filter by date, category, author before vector search
  • Keyword search: BM25 for exact matches, embeddings for semantic similarity
  • Reranking: Use a larger model to rerank top-K results from a faster model

Cost Analysis

Training costs (estimated on cloud infrastructure):

Model Training Data GPU Hours Estimated Cost
CLIP-style dual encoder 400M pairs ~5K hours (A100) $15K-$25K
VLM2Vec-V2 (2B) fine-tuning 50M instruction pairs ~2K hours $6K-$10K
Omni-Embed from scratch 1B+ mixed pairs ~20K hours $60K-$100K

Inference costs (per 1M embeddings):

  • VLM2Vec-V2 (2B) on A100: ~$2-3
  • VLM2Vec-V2 (7B) on A100: ~$8-12
  • Amazon Nova (Bedrock): ~$15-25 (managed service premium)

The Road Ahead

The evolution from CLIP to unified foundation models represents more than architectural progress—it's a fundamental shift in how AI systems understand the world. The key takeaways:

  1. Instruction tuning is the new paradigm: Static embeddings are being replaced by dynamic, promptable representations
  2. Unified models outperform binding: Joint training across modalities yields better results than pairwise alignment
  3. Visual documents are underrated: Training on PDFs and slides improves general vision understanding
  4. Specialization still matters: For pure audio or video tasks, specialized models can still offer advantages

In the next post, we'll dive deep into video embeddings—exploring how models handle the fourth dimension (time) and the specific challenges of temporal reasoning. We'll examine VideoMAE V2's dual masking strategy and why simple frame averaging fails for action recognition.

Get in Touch

Need help implementing multimodal embeddings in your AI applications? Want to discuss RAG architecture strategies or custom model fine-tuning?

Connect with me:

Whether you're looking for consulting services, training, or just want to discuss multimodal AI architecture, I'd love to hear from you!

References

  1. Liang, V. W., Zhang, Y., Kwon, Y., Yeung, S., & Zou, J. Y. (2022). Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning. arXiv preprint arXiv:2203.02053.

  2. Wu, J., et al. (2024). VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents. arXiv preprint arXiv:2507.04590.

  3. NVIDIA. (2024). Omni-Embed-Nemotron: A Unified Multimodal Retrieval Model for Text, Image, Audio, and Video. Hugging Face Model Hub.

  4. Radford, A., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML 2021.

  5. Deshmukh, S., et al. (2024). Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities. arXiv preprint arXiv:2503.03983.

  6. Girdhar, R., et al. (2023). ImageBind: One Embedding Space To Bind Them All. CVPR 2023.

  7. Zhu, L., et al. (2023). LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment. ICLR 2024.

  8. Amazon Web Services. (2024). Amazon Nova Multimodal Embeddings: State-of-the-art embedding model for agentic RAG and semantic search. AWS News Blog.

  9. Deshmukh, S., et al. (2024). From Waveforms to Wisdom: The New Benchmark for Auditory Intelligence. Google Research Blog.

  10. Guzhov, A., et al. (2021). AudioCLIP: Extending CLIP to Image, Text and Audio. arXiv preprint arXiv:2106.13043.

  11. TIGER Lab. (2024). MMEB-V2: Massive Multimodal Embedding Benchmark. GitHub Repository.

  12. Zhai, X., et al. (2023). Sigmoid Loss for Language Image Pre-Training. ICCV 2023.