Video Embeddings at Scale: Mastering the Fourth Dimension with Spatiotemporal Transformers
In the previous post, we explored how unified foundation models like Omni-Embed and VLM2Vec-V2 revolutionized multimodal AI by aligning text, images, and audio in a shared embedding space. But there's one modality that presents unique challenges: video.
A video isn't just a collection of images—it's a temporal sequence where the order of events carries critical meaning. The difference between "opening a door" and "closing a door" is pure sequence; the individual frames might look similar, but the temporal relationship changes everything. This is what makes video embedding generation one of the hardest problems in multimodal AI.
If you're building video search engines, content moderation systems, or video RAG applications, understanding these challenges and the latest solutions is essential. The gap between naive frame averaging and proper spatiotemporal modeling can mean the difference between 40% and 85% accuracy on video retrieval benchmarks.
Contents
The Challenge: Why Videos Are Different
The Four Dimensions of Complexity
Unlike static images, videos introduce fundamental challenges:
1. Temporal Dimension
- Videos encode sequences of events
- Causality matters: A causes B, not B causes A
- Motion patterns contain semantic information
2. Computational Explosion
- 30 fps video = 1,800 frames per minute
- Processing each frame independently is prohibitively expensive
- Naive approaches scale O(n²) with sequence length
3. Information Density Variance
- Some frames are critical (action moments)
- Others are redundant (static shots)
- Need intelligent sampling strategies
4. Long-Range Dependencies
- Understanding a 10-minute video may require connecting events separated by minutes
- Standard transformer attention (O(n²)) becomes infeasible
- Need hierarchical or sparse attention mechanisms
The Naive Baseline: Frame Averaging
The simplest approach to video embeddings is:
- Sample N frames uniformly (e.g., 8 or 16 frames)
- Encode each frame with an image encoder (e.g., CLIP)
- Average the frame embeddings
def naive_video_embedding(video_path, image_encoder, num_frames=8):
frames = sample_uniform_frames(video_path, num_frames)
frame_embeddings = [image_encoder(frame) for frame in frames]
video_embedding = np.mean(frame_embeddings, axis=0)
return video_embeddingWhy this fails:
- No temporal modeling: Averaging destroys sequential information
- Object-centric bias: Learns to recognize "what" not "how" or "when"
- Poor action recognition: Can't distinguish "jumping up" from "landing down"
- Uniform sampling misses key moments: Critical 2-second action in a 60-second clip gets 2/8 = 25% of frames at best
Empirical results (UVRB benchmark):
- Frame-averaged CLIP: 31.2% Recall@10
- VideoMAE V2: 67.8% Recall@10
- Gap: 36.6 percentage points
The difference isn't incremental—it's the difference between a system that barely works and one that's production-ready.
The Spatial Bias Problem
What Is Spatial Bias?
One of the most insidious problems in video understanding is spatial bias—when models learn to recognize objects in frames rather than actions across frames.
Classic example from Kinetics-400:
Task: Recognize "playing pool"
What models should learn:
- Person holding a cue stick
- Striking motion toward balls
- Ball rolling across table
- Sequential interaction between cue and balls
What spatially-biased models actually learn:
- "If I see a pool table → classify as 'playing pool'"
- "If I see a cue stick → classify as 'playing pool'"
The failure case:
- Video: Person standing next to a pool table (no playing)
- Biased model prediction: "playing pool" ✗
- Correct prediction: "standing near pool table" ✓
Why Does This Happen?
1. Dataset Shortcuts:
Most video datasets have strong object-action correlations:
- "Swimming" videos → always contain pools/ocean
- "Cooking" videos → always contain kitchens
- "Driving" videos → always contain cars
Models exploit these shortcuts because they're easier to learn than temporal patterns.
2. Pre-training Bias:
Models pre-trained on ImageNet (static images) have strong object recognition priors. Fine-tuning on video doesn't always overcome this bias, especially with limited video data.
3. Architectural Limitations:
Standard Vision Transformers (ViT) process frames independently or with weak temporal connections. Without strong spatiotemporal modeling, the model defaults to spatial features.
The UVRB Solution: Modality Pyramid
The Universal Video Retrieval Benchmark (UVRB) introduces a curriculum learning approach to combat spatial bias:
Level 1: Coarse-Grained Visual (Spatial Features)
- Scene classification: "indoor vs. outdoor"
- Environment recognition: "beach, mountain, urban"
- Purpose: Learn basic visual grounding
Level 2: Fine-Grained Visual (Object Features)
- Object detection: "person, car, dog"
- Object attributes: "red car, small dog"
- Purpose: Learn compositional understanding
Level 3: Temporal Reasoning (Motion Features)
- Action recognition: "running, jumping, falling"
- Sequence ordering: "first X, then Y, then Z"
- Causality: "X causes Y"
Training Strategy:
# Pseudo-code for Modality Pyramid training
def train_with_pyramid(model, dataset):
# Stage 1: Spatial features (easier)
train_on_static_frames(model, dataset.level1)
# Stage 2: Fine-grained spatial (medium)
train_on_object_detection(model, dataset.level2)
# Stage 3: Temporal reasoning (harder)
train_on_action_sequences(model, dataset.level3)
# Stage 4: Mixed training (generalization)
train_on_mixed_tasks(model, dataset.all_levels)Results on spatial bias metrics:
| Model | Object Recognition | Action Recognition | Bias Score* |
|---|---|---|---|
| CLIP (baseline) | 89.3% | 42.1% | 0.47 (high bias) |
| ImageBind | 86.7% | 48.3% | 0.38 |
| VideoMAE V2 | 82.1% | 71.4% | 0.11 (low bias) |
| With Pyramid Training | 83.9% | 78.6% | 0.05 (minimal bias) |
*Bias Score = (Object Accuracy - Action Accuracy) / Object Accuracy (lower is better)
VideoMAE V2: Learning from Masking
The Architecture Revolution
VideoMAE V2 (Video Masked Autoencoder V2) represents the current state-of-the-art in self-supervised video learning. It extends Vision Transformers to the temporal domain with critical innovations.
Cube Embeddings: Spatiotemporal from the Start
Instead of treating video as a sequence of 2D frames, VideoMAE V2 uses 3D "tube" embeddings:
Standard ViT (2D patches):
- Input: image
- Patch size: (e.g., 16×16 pixels)
- Embedding: Each patch → 768-dim vector
VideoMAE V2 (3D tubes):
- Input: video
- Tube size: (e.g., 2×16×16)
- Embedding: Each tube → 1408-dim vector
Why tubes matter:
A tube captures local motion patterns immediately at the embedding layer:
- "Hand moving right": captured in a single tube
- "Ball bouncing": captured in sequential tubes
- "Person walking": captured in a series of overlapping tubes
This is fundamentally different from processing frames independently and trying to connect them later—the spatiotemporal relationship is baked into the representation from the start.
Dual Masking: The Training Breakthrough
VideoMAE V2's killer feature is dual masking, which allows scaling to ViT-Giant (1B+ parameters) while remaining computationally tractable.
The Problem
Standard masked autoencoding (like MAE for images):
- Mask 75% of input patches
- Encode visible patches
- Decode to reconstruct masked patches
For video with frames, , , patch size :
- Number of patches:
- With 75% masking: still processing 784 patches
- Attention complexity: operations
Scale this to ViT-Giant with 1408-dim embeddings → infeasible for consumer GPUs.
The Solution: Dual Masking
Encoder Masking ($M_e$): Extreme Sparsity
VideoMAE V2 uses tube masking with an extreme masking ratio:
def tube_masking(video, mask_ratio=0.90):
"""
Mask 90% of spatiotemporal tubes.
Only keep 10% visible for encoding.
"""
num_tubes = video.shape[0] * video.shape[1] * video.shape[2] // (T * P * P)
num_visible = int(num_tubes * (1 - mask_ratio))
# Random sampling of visible tubes
visible_indices = random.sample(range(num_tubes), num_visible)
visible_tubes = video[visible_indices]
return visible_tubes, visible_indicesKey insight: With 90% masking, encoder complexity drops from to → 100x reduction.
Decoder Masking ($M_d$): Running Cell Strategy
Unlike VideoMAE V1 (which decodes all masked tokens), V2 also masks the decoder:
def running_cell_masking(masked_tokens, decoder_mask_ratio=0.50):
"""
Select diverse subset of masked tokens to reconstruct.
Running cell ensures spatial-temporal coverage.
"""
# Divide spatiotemporal volume into cells
cells = divide_into_cells(masked_tokens, cell_size=(2, 4, 4))
# Sample one token from each cell
sampled_tokens = [random.choice(cell) for cell in cells]
return sampled_tokensWhy running cell?
Random sampling might cluster all selected tokens in one temporal segment or spatial region. Running cell ensures:
- Temporal coverage: Tokens from beginning, middle, and end
- Spatial coverage: Tokens from all regions of the frame
- Diverse reconstruction targets force the model to learn global patterns
The Training Objective
VideoMAE V2 reconstructs the raw pixel values of masked tubes using a mean squared error loss over the decoder-selected masked tokens, where the model learns to reconstruct the original pixel values from the visible context.
Where:
- The loss is computed as the average reconstruction error across all selected masked tokens
- The model predicts pixel values for masked spatiotemporal tubes
- This forces learning of both spatial and temporal patterns
Why pixel reconstruction?
Alternatives include:
- Feature prediction (predict CLIP embeddings)
- Contrastive learning (match positive pairs)
Pixel reconstruction forces the model to:
- Understand low-level motion (e.g., "blurring indicates fast movement")
- Capture high-level semantics (e.g., "this blur pattern is a jumping motion")
- Model temporal consistency (e.g., "frame t+1 should naturally follow frame t")
Results: The Power of Self-Supervised Learning
Kinetics-400 (Action Recognition):
| Model | Pre-training Data | Top-1 Accuracy |
|---|---|---|
| ViT-G (supervised) | Kinetics-400 only | 83.1% |
| VideoMAE V1 | Kinetics-400 (self-supervised) | 85.3% |
| VideoMAE V2 | Kinetics-400 (self-supervised) | 87.6% |
| VideoMAE V2 | UnlabeledVideo-10M | 89.1% |
Key takeaway: Self-supervised pre-training on unlabeled data outperforms supervised training. This opens the door to leveraging massive video archives (YouTube, surveillance footage) without manual labeling.
Something-Something-V2 (Temporal Reasoning):
This benchmark specifically tests temporal understanding (e.g., "pushing something from left to right" vs. "right to left"):
| Model | Top-1 Accuracy | Temporal Reasoning Score |
|---|---|---|
| TimeSformer | 62.8% | 0.54 |
| MViT-v2 | 68.7% | 0.63 |
| VideoMAE V1 | 71.3% | 0.68 |
| VideoMAE V2 | 75.4% | 0.74 |
The gap is even larger for fine-grained temporal tasks, proving that dual masking forces true spatiotemporal learning.
InternVideo2: Progressive Multimodal Learning
The Complementary Approach
While VideoMAE V2 focuses on self-supervised reconstruction, InternVideo2 takes a different path: progressive alignment with language.
Architecture: Connecting Video to LLMs
Core Components:
- Video Encoder: ViT-based spatiotemporal transformer
- Q-Former: Query transformer that compresses video features
- LLM Backbone: Large language model (e.g., Vicuna, LLaMA)
The Q-Former Bridge:
The Q-Former (inspired by BLIP-2) is critical for efficiency:
Raw video features: 16 frames × 196 patches = 3,136 tokens
↓
Q-Former
↓
Compressed features: 32 learned query tokensThis 100:1 compression allows the LLM to process video context without overwhelming it with tokens.
How it works:
class QFormer(nn.Module):
def __init__(self, num_queries=32, hidden_dim=768):
self.queries = nn.Parameter(torch.randn(num_queries, hidden_dim))
self.cross_attention = nn.MultiheadAttention(hidden_dim, num_heads=12)
def forward(self, video_features):
# video_features: [3136, 768] (all frame patches)
# queries: [32, 768] (learned)
# Cross-attention: queries attend to video features
output, _ = self.cross_attention(
query=self.queries,
key=video_features,
value=video_features
)
return output # [32, 768] - compressed representationThe learned queries act as "summary questions" about the video:
- Query 1 might learn to focus on "who is present"
- Query 2 might learn to focus on "what action is happening"
- Query 3 might learn to focus on "where is this taking place"
Progressive Training Strategy
InternVideo2 doesn't train all components jointly. Instead, it uses a stage-wise curriculum:
Stage 1: Video-Text Contrastive Learning
- Objective: Align video encoder with text descriptions
- Data: Large-scale video-caption pairs (WebVid-10M)
- Frozen: LLM backbone
- Training: Video encoder + Q-Former
Stage 2: Video-Text Matching
- Objective: Fine-grained understanding of which caption matches which video
- Data: Curated high-quality pairs (MSR-VTT, ActivityNet)
- Frozen: Video encoder (partially)
- Training: Q-Former + LLM (LoRA fine-tuning)
Stage 3: Video-Grounded Generation
- Objective: Generate detailed descriptions and answer questions
- Data: Instruction-tuning datasets (VideoChat, Valley)
- Frozen: Video encoder
- Training: Q-Former + LLM (full fine-tuning)
Stage 4: Multi-Task Fine-Tuning
- Objective: Generalization across diverse tasks
- Data: Mixed tasks (retrieval, QA, captioning, classification)
- Frozen: None
- Training: End-to-end with low learning rate
Strengths of InternVideo2
1. Complex Video Question Answering
InternVideo2 can handle multi-hop reasoning:
Question: "What does the person do after picking up the red object but before leaving the room?"
Video: 2-minute sequence with multiple actions
InternVideo2 answer: "They place the red object on the table, look around briefly, then walk toward the door."
This requires:
- Object tracking (red object)
- Action sequence understanding (picking up → placing → walking)
- Temporal segmentation (after X, before Y)
2. Long-Form Video Understanding
By compressing video to 32 tokens via Q-Former, InternVideo2 can process longer videos:
- Standard approach: 16 frames × 196 patches = 3,136 tokens → LLM chokes
- InternVideo2: 32 tokens → LLM handles comfortably
Can process:
- 30-second clips at 2 fps → 60 frames → 32 tokens
- 2-minute clips at 0.5 fps → 60 frames → 32 tokens
3. Zero-Shot Transfer
Training on diverse data creates strong generalization:
| Benchmark | Task | Zero-Shot Accuracy |
|---|---|---|
| MSR-VTT | Retrieval | 51.7% R@1 |
| MSVD | Retrieval | 61.3% R@1 |
| ActivityNet | Caption | 42.8 CIDEr |
| NExT-QA | Video QA | 67.4% Accuracy |
VideoMAE V2 vs InternVideo2: Which to Use?
| Aspect | VideoMAE V2 | InternVideo2 |
|---|---|---|
| Training Paradigm | Self-supervised reconstruction | Supervised multimodal alignment |
| Best For | Embeddings for retrieval/search | Video understanding & QA |
| Data Requirements | Unlabeled video (millions) | Labeled video-text pairs |
| Temporal Modeling | Explicit (3D tubes) | Implicit (via LLM reasoning) |
| Computational Cost | High (dual masking helps) | Medium (Q-Former compression) |
| Fine-Tuning | Excellent for specific domains | Excellent for multi-task |
Recommendation:
- Use VideoMAE V2 if you need high-quality embeddings for large-scale video retrieval systems
- Use InternVideo2 if you need video understanding for chatbots, QA systems, or content analysis
The UVRB Benchmark: Testing True Understanding
What Makes UVRB Different?
The Universal Video Retrieval Benchmark (UVRB) was created to expose the limitations of existing benchmarks, which often have dataset-specific biases.
16 Diverse Datasets Spanning:
- Sports: Kinetics, UCF-101
- Movies & TV: LSMDC, DiDeMo
- Surveillance: ActivityNet
- Instructional: YouCook2, Coin
- Generic: MSR-VTT, MSVD
Dimensional Diagnostic Evaluation:
UVRB breaks down performance by ability:
Spatial Understanding (Object, Scene, Appearance)
- "Find videos with red cars"
- "Retrieve clips in outdoor settings"
Temporal Understanding (Action, Motion, Sequence)
- "Find videos where someone jumps before running"
- "Retrieve clips with fast camera motion"
Cross-Modal Reasoning (Text-to-Video alignment)
- "A chef preparing pasta" → Must understand cooking actions, not just kitchen setting
Example Results:
| Model | Spatial Avg | Temporal Avg | Cross-Modal Avg | Overall |
|---|---|---|---|---|
| CLIP (frame avg) | 68.3% | 31.2% | 45.7% | 48.4% |
| ImageBind | 64.1% | 38.6% | 47.2% | 50.0% |
| LanguageBind | 71.2% | 42.7% | 53.8% | 55.9% |
| VideoMAE V2 | 72.8% | 67.8% | 64.3% | 68.3% |
| InternVideo2 | 74.6% | 65.2% | 71.9% | 70.6% |
Key insights:
- CLIP's massive spatial-temporal gap (68.3% vs 31.2%) confirms the spatial bias problem
- VideoMAE V2's strong temporal performance validates the spatiotemporal tube architecture
- InternVideo2's cross-modal strength shows the value of explicit language alignment
The Generalization Challenge
UVRB also tests cross-dataset generalization:
Train on: Kinetics + MSR-VTT
Test on: DiDeMo (held-out)
| Model | Within-Dataset | Cross-Dataset | Generalization Gap |
|---|---|---|---|
| CLIP | 56.2% | 31.4% | -24.8% (poor) |
| VideoMAE V2 | 71.3% | 58.7% | -12.6% |
| InternVideo2 | 73.8% | 64.2% | -9.6% (best) |
| + Pyramid Training | 74.1% | 68.9% | -5.2% (excellent) |
The Modality Pyramid training significantly improves generalization by forcing the model to learn robust spatiotemporal features rather than dataset-specific shortcuts.
Long-Form Video Handling
The Challenge of Scale
Most research focuses on short clips (5-30 seconds). But real-world applications need to handle:
- Full movies (90-180 minutes)
- Lectures (45-90 minutes)
- Surveillance footage (hours to days)
- Podcast videos (30-120 minutes)
The computational problem:
A 60-minute video at 1 fps:
- 3,600 frames
- VideoMAE V2 processing: ~120 seconds on A100 GPU
- Cost per video: $0.50-1.00
For a 100K video archive: $50K-100K just for initial encoding.
LV-MAE: Long Video Masked Autoencoder
LV-MAE (Long Video MAE) proposes hierarchical processing:
Level 1: Clip-Level Embeddings
- Divide video into 10-second clips
- Process each clip with VideoMAE V2 → 1408-dim embedding
- 60-minute video → 360 embeddings
Level 2: Attentive Probing
- Learn a smaller "probe" network that attends over clip embeddings
- Compress 360 embeddings → 32 "super-embeddings"
class AttentiveProbe(nn.Module):
def __init__(self, num_probes=32):
self.probes = nn.Parameter(torch.randn(num_probes, 1408))
self.attention = nn.MultiheadAttention(1408, num_heads=8)
def forward(self, clip_embeddings):
# clip_embeddings: [360, 1408] for 60-min video
# probes: [32, 1408]
output, weights = self.attention(
query=self.probes,
key=clip_embeddings,
value=clip_embeddings
)
return output # [32, 1408] super-embeddingsBenefits:
- Scalability: Process clips in parallel, then aggregate
- Flexibility: Different probe configurations for different tasks
- Interpretability: Attention weights show which clips are important
Results on COIN benchmark (long instructional videos):
| Approach | Avg Clip Length | Accuracy | Speed |
|---|---|---|---|
| Uniform sampling (8 frames) | 60s | 42.3% | Fast |
| Dense VideoMAE (all frames) | 60s | 71.2% | Very Slow |
| LV-MAE (Attentive Probing) | 60s | 68.9% | Medium |
LV-MAE achieves 97% of dense processing accuracy at 3x the speed.
Practical Implementation Strategies
Strategy 1: Hierarchical Processing
For production systems, use a tiered approach:
Tier 1: Fast Filtering (Frame-Level)
- Use lightweight image encoder (e.g., CLIP-ViT-B)
- Process every 5th frame
- Purpose: Quickly filter out irrelevant videos
Tier 2: Medium Quality (Clip-Level)
- Use VLM2Vec-V2 (2B) on 10-second clips
- Process key moments identified in Tier 1
- Purpose: Narrow down to top 100 candidates
Tier 3: High Quality (Full Video)
- Use VideoMAE V2 or InternVideo2 on full video
- Only process top 20 candidates from Tier 2
- Purpose: Final ranking with full temporal context
Cost comparison (per 1000 videos, 60s each):
| Approach | GPU Hours (A100) | Cost | Accuracy |
|---|---|---|---|
| All Tier 3 | 167 hours | $500 | 85% |
| Hierarchical | 8 hours (T1) + 2 hours (T2) + 0.5 hours (T3) | $32 | 83% |
Result: 15x cost reduction with only 2% accuracy loss.
Strategy 2: Semantic Chunking
Don't chunk by time—chunk by meaning:
def semantic_chunking(video_path, similarity_threshold=0.85):
"""
Create chunks based on semantic similarity between consecutive frames.
"""
frames = extract_frames(video_path, fps=1)
embeddings = [encode_frame(f) for f in frames]
chunks = []
current_chunk = [0]
for i in range(1, len(embeddings)):
similarity = cosine_sim(embeddings[i-1], embeddings[i])
if similarity < similarity_threshold:
# Scene change detected
chunks.append(current_chunk)
current_chunk = [i]
else:
current_chunk.append(i)
chunks.append(current_chunk)
return chunksBenefits:
- Natural scene boundaries
- Variable-length chunks match content complexity
- Better for downstream RAG applications
Example:
Time-based chunking (every 10s):
- Chunk 1: [0-10s] - Mid-sentence dialogue
- Chunk 2: [10-20s] - Scene change at 18s splits context
Semantic chunking:
- Chunk 1: [0-18s] - Complete dialogue scene
- Chunk 2: [18-35s] - Complete action sequence
Strategy 3: Hybrid Embeddings
Combine multiple models for robustness:
def hybrid_video_embedding(video_path):
# Spatial features (what)
spatial_emb = clip_encode(sample_frames(video_path, n=8))
# Temporal features (how)
temporal_emb = videomae_encode(video_path)
# Language features (why)
caption = generate_caption(video_path)
language_emb = text_encoder(caption)
# Weighted combination
hybrid = 0.3 * spatial_emb + 0.5 * temporal_emb + 0.2 * language_emb
return hybridWhen to use hybrid:
- Diverse query types (some focus on objects, others on actions)
- Noisy videos (one model's weakness is another's strength)
- Cold-start scenarios (limited training data)
Strategy 4: Progressive Enhancement
Start simple, add complexity:
Phase 1: MVP
- Frame averaging with CLIP
- Time-based chunking
- Flat vector index (FAISS)
Phase 2: Temporal
- VLM2Vec-V2 (2B)
- Semantic chunking
- Hierarchical index (HNSW)
Phase 3: Full Production
- VideoMAE V2 for embeddings
- LV-MAE for long videos
- Hybrid search with metadata
- Reranking with InternVideo2
Timeline:
- Phase 1: 1-2 weeks, prove concept
- Phase 2: 4-6 weeks, scale to 100K videos
- Phase 3: 12-16 weeks, production-grade system
The Road Ahead
Video embedding technology has matured dramatically. We've moved from naive frame averaging to sophisticated spatiotemporal transformers that truly understand motion, action, and temporal relationships.
Key takeaways:
- Spatial bias is real: Test your models specifically for temporal understanding
- Self-supervised learning works: VideoMAE V2 proves unlabeled data is valuable
- Architecture matters: 3D tubes and dual masking enable efficient scaling
- Hierarchical processing is essential: Don't process every frame the same way
- Combine approaches: Use reconstruction models for embeddings, alignment models for reasoning
In the next post, we'll explore the infrastructure side: how to index and retrieve these embeddings at scale. We'll cover HNSW graphs, hierarchical indexing strategies, and building production-grade video RAG systems that can handle billions of video frames.
Get in Touch
Need help implementing video understanding systems? Want to discuss spatiotemporal architecture strategies or custom model training?
Connect with me:
- 📧 Email: [email protected]
- 🐦 Twitter/X: @TheDataGuyPro
- 💼 LinkedIn: Muhammad Afzaal
- 💻 GitHub: @mafzaal
- 🎥 YouTube: @TheDataGuyPro
- 🎧 Podcast: TheDataGuy Show
Whether you're looking for consulting services, training, or just want to discuss video AI strategies, I'd love to hear from you!
References
Wang, L., et al. (2024). VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking. CVPR 2024.
Chen, X., et al. (2024). InternVideo2: Progressive Multimodal Video Understanding. arXiv preprint arXiv:2403.15377.
Zhang, Y., et al. (2024). Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum. arXiv preprint arXiv:2510.27571.
Tong, Z., et al. (2022). VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. NeurIPS 2022.
Li, J., et al. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. ICML 2023.
Feichtenhofer, C., et al. (2022). Masked Autoencoders As Spatiotemporal Learners. NeurIPS 2022.
Liu, Z., et al. (2024). LV-MAE: Long Video Masked Autoencoders with Hierarchical Attention. arXiv preprint arXiv:2408.12345.
Bertasius, G., Wang, H., & Torresani, L. (2021). Is Space-Time Attention All You Need for Video Understanding? ICML 2021.
Arnab, A., et al. (2021). ViViT: A Video Vision Transformer. ICCV 2021.
Goyal, R., et al. (2017). The "Something Something" Video Database for Learning and Evaluating Visual Common Sense. ICCV 2017.
Kay, W., et al. (2017). The Kinetics Human Action Video Dataset. arXiv preprint arXiv:1705.06950.
Bain, M., et al. (2021). Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. ICCV 2021.