Video Embeddings at Scale: Mastering the Fourth Dimension with Spatiotemporal Transformers

In the previous post, we explored how unified foundation models like Omni-Embed and VLM2Vec-V2 revolutionized multimodal AI by aligning text, images, and audio in a shared embedding space. But there's one modality that presents unique challenges: video.

A video isn't just a collection of images—it's a temporal sequence where the order of events carries critical meaning. The difference between "opening a door" and "closing a door" is pure sequence; the individual frames might look similar, but the temporal relationship changes everything. This is what makes video embedding generation one of the hardest problems in multimodal AI.

If you're building video search engines, content moderation systems, or video RAG applications, understanding these challenges and the latest solutions is essential. The gap between naive frame averaging and proper spatiotemporal modeling can mean the difference between 40% and 85% accuracy on video retrieval benchmarks.

The Challenge: Why Videos Are Different

The Four Dimensions of Complexity

Unlike static images, videos introduce fundamental challenges:

1. Temporal Dimension

Videos encode sequences of events
Causality matters: A causes B, not B causes A
Motion patterns contain semantic information

2. Computational Explosion

30 fps video = 1,800 frames per minute
Processing each frame independently is prohibitively expensive
Naive approaches scale O(n²) with sequence length

3. Information Density Variance

Some frames are critical (action moments)
Others are redundant (static shots)
Need intelligent sampling strategies

4. Long-Range Dependencies

Understanding a 10-minute video may require connecting events separated by minutes
Standard transformer attention (O(n²)) becomes infeasible
Need hierarchical or sparse attention mechanisms

The Naive Baseline: Frame Averaging

The simplest approach to video embeddings is:

Sample N frames uniformly (e.g., 8 or 16 frames)
Encode each frame with an image encoder (e.g., CLIP)
Average the frame embeddings

def naive_video_embedding(video_path, image_encoder, num_frames=8):
    frames = sample_uniform_frames(video_path, num_frames)
    frame_embeddings = [image_encoder(frame) for frame in frames]
    video_embedding = np.mean(frame_embeddings, axis=0)
    return video_embedding

Why this fails:

No temporal modeling: Averaging destroys sequential information
Object-centric bias: Learns to recognize "what" not "how" or "when"
Poor action recognition: Can't distinguish "jumping up" from "landing down"
Uniform sampling misses key moments: Critical 2-second action in a 60-second clip gets 2/8 = 25% of frames at best

Empirical results (UVRB benchmark):

Frame-averaged CLIP: 31.2% Recall@10
VideoMAE V2: 67.8% Recall@10
Gap: 36.6 percentage points

The difference isn't incremental—it's the difference between a system that barely works and one that's production-ready.

The Spatial Bias Problem

What Is Spatial Bias?

One of the most insidious problems in video understanding is spatial bias—when models learn to recognize objects in frames rather than actions across frames.

Classic example from Kinetics-400:

Task: Recognize "playing pool"

What models should learn:

Person holding a cue stick
Striking motion toward balls
Ball rolling across table
Sequential interaction between cue and balls

What spatially-biased models actually learn:

"If I see a pool table → classify as 'playing pool'"
"If I see a cue stick → classify as 'playing pool'"

The failure case:

Video: Person standing next to a pool table (no playing)
Biased model prediction: "playing pool" ✗
Correct prediction: "standing near pool table" ✓

Why Does This Happen?

1. Dataset Shortcuts:

Most video datasets have strong object-action correlations:

"Swimming" videos → always contain pools/ocean
"Cooking" videos → always contain kitchens
"Driving" videos → always contain cars

Models exploit these shortcuts because they're easier to learn than temporal patterns.

2. Pre-training Bias:

Models pre-trained on ImageNet (static images) have strong object recognition priors. Fine-tuning on video doesn't always overcome this bias, especially with limited video data.

3. Architectural Limitations:

Standard Vision Transformers (ViT) process frames independently or with weak temporal connections. Without strong spatiotemporal modeling, the model defaults to spatial features.

The UVRB Solution: Modality Pyramid

The Universal Video Retrieval Benchmark (UVRB) introduces a curriculum learning approach to combat spatial bias:

Level 1: Coarse-Grained Visual (Spatial Features)

Scene classification: "indoor vs. outdoor"
Environment recognition: "beach, mountain, urban"
Purpose: Learn basic visual grounding

Level 2: Fine-Grained Visual (Object Features)

Object detection: "person, car, dog"
Object attributes: "red car, small dog"
Purpose: Learn compositional understanding

Level 3: Temporal Reasoning (Motion Features)

Action recognition: "running, jumping, falling"
Sequence ordering: "first X, then Y, then Z"
Causality: "X causes Y"

Training Strategy:

# Pseudo-code for Modality Pyramid training
def train_with_pyramid(model, dataset):
    # Stage 1: Spatial features (easier)
    train_on_static_frames(model, dataset.level1)

    # Stage 2: Fine-grained spatial (medium)
    train_on_object_detection(model, dataset.level2)

    # Stage 3: Temporal reasoning (harder)
    train_on_action_sequences(model, dataset.level3)

    # Stage 4: Mixed training (generalization)
    train_on_mixed_tasks(model, dataset.all_levels)

Results on spatial bias metrics:

Model	Object Recognition	Action Recognition	Bias Score*
CLIP (baseline)	89.3%	42.1%	0.47 (high bias)
ImageBind	86.7%	48.3%	0.38
VideoMAE V2	82.1%	71.4%	0.11 (low bias)
With Pyramid Training	83.9%	78.6%	0.05 (minimal bias)

*Bias Score = (Object Accuracy - Action Accuracy) / Object Accuracy (lower is better)

VideoMAE V2: Learning from Masking

The Architecture Revolution

VideoMAE V2 (Video Masked Autoencoder V2) represents the current state-of-the-art in self-supervised video learning. It extends Vision Transformers to the temporal domain with critical innovations.

Cube Embeddings: Spatiotemporal from the Start

Instead of treating video as a sequence of 2D frames, VideoMAE V2 uses 3D "tube" embeddings:

Standard ViT (2D patches):

Input: $H \times W$ image
Patch size: $P \times P$ (e.g., 16×16 pixels)
Embedding: Each patch → 768-dim vector

VideoMAE V2 (3D tubes):

Input: $T \times H \times W$ video
Tube size: $t \times h \times w$ (e.g., 2×16×16)
Embedding: Each tube → 1408-dim vector

Why tubes matter:

A tube captures local motion patterns immediately at the embedding layer:

"Hand moving right": captured in a single tube
"Ball bouncing": captured in sequential tubes
"Person walking": captured in a series of overlapping tubes

This is fundamentally different from processing frames independently and trying to connect them later—the spatiotemporal relationship is baked into the representation from the start.

Dual Masking: The Training Breakthrough

VideoMAE V2's killer feature is dual masking, which allows scaling to ViT-Giant (1B+ parameters) while remaining computationally tractable.

The Problem

Standard masked autoencoding (like MAE for images):

Mask 75% of input patches
Encode visible patches
Decode to reconstruct masked patches

For video with $T = 16$ frames, $H = 224$ , $W = 224$ , patch size $16$ :

Number of patches: $16 \times 14 \times 14 = 3, 136$
With 75% masking: still processing 784 patches
Attention complexity: $O (78 4^{2}) = 614, 656$ operations

Scale this to ViT-Giant with 1408-dim embeddings → infeasible for consumer GPUs.

The Solution: Dual Masking

Encoder Masking ($M_e$): Extreme Sparsity

VideoMAE V2 uses tube masking with an extreme masking ratio:

def tube_masking(video, mask_ratio=0.90):
    """
    Mask 90% of spatiotemporal tubes.
    Only keep 10% visible for encoding.
    """
    num_tubes = video.shape[0] * video.shape[1] * video.shape[2] // (T * P * P)
    num_visible = int(num_tubes * (1 - mask_ratio))

    # Random sampling of visible tubes
    visible_indices = random.sample(range(num_tubes), num_visible)
    visible_tubes = video[visible_indices]

    return visible_tubes, visible_indices

Key insight: With 90% masking, encoder complexity drops from $O (N^{2})$ to $O ((0.1 N)^{2})$ → 100x reduction.

Decoder Masking ($M_d$): Running Cell Strategy

Unlike VideoMAE V1 (which decodes all masked tokens), V2 also masks the decoder:

def running_cell_masking(masked_tokens, decoder_mask_ratio=0.50):
    """
    Select diverse subset of masked tokens to reconstruct.
    Running cell ensures spatial-temporal coverage.
    """
    # Divide spatiotemporal volume into cells
    cells = divide_into_cells(masked_tokens, cell_size=(2, 4, 4))

    # Sample one token from each cell
    sampled_tokens = [random.choice(cell) for cell in cells]

    return sampled_tokens

Why running cell?

Random sampling might cluster all selected tokens in one temporal segment or spatial region. Running cell ensures:

Temporal coverage: Tokens from beginning, middle, and end
Spatial coverage: Tokens from all regions of the frame
Diverse reconstruction targets force the model to learn global patterns

The Training Objective

VideoMAE V2 reconstructs the raw pixel values of masked tubes using a mean squared error loss over the decoder-selected masked tokens, where the model learns to reconstruct the original pixel values from the visible context.

Where:

The loss is computed as the average reconstruction error across all selected masked tokens
The model predicts pixel values for masked spatiotemporal tubes
This forces learning of both spatial and temporal patterns

Why pixel reconstruction?

Alternatives include:

Feature prediction (predict CLIP embeddings)
Contrastive learning (match positive pairs)

Pixel reconstruction forces the model to:

Understand low-level motion (e.g., "blurring indicates fast movement")
Capture high-level semantics (e.g., "this blur pattern is a jumping motion")
Model temporal consistency (e.g., "frame t+1 should naturally follow frame t")

Results: The Power of Self-Supervised Learning

Kinetics-400 (Action Recognition):

Model	Pre-training Data	Top-1 Accuracy
ViT-G (supervised)	Kinetics-400 only	83.1%
VideoMAE V1	Kinetics-400 (self-supervised)	85.3%
VideoMAE V2	Kinetics-400 (self-supervised)	87.6%
VideoMAE V2	UnlabeledVideo-10M	89.1%

Key takeaway: Self-supervised pre-training on unlabeled data outperforms supervised training. This opens the door to leveraging massive video archives (YouTube, surveillance footage) without manual labeling.

Something-Something-V2 (Temporal Reasoning):

This benchmark specifically tests temporal understanding (e.g., "pushing something from left to right" vs. "right to left"):

Model	Top-1 Accuracy	Temporal Reasoning Score
TimeSformer	62.8%	0.54
MViT-v2	68.7%	0.63
VideoMAE V1	71.3%	0.68
VideoMAE V2	75.4%	0.74

The gap is even larger for fine-grained temporal tasks, proving that dual masking forces true spatiotemporal learning.

InternVideo2: Progressive Multimodal Learning

The Complementary Approach

While VideoMAE V2 focuses on self-supervised reconstruction, InternVideo2 takes a different path: progressive alignment with language.

Architecture: Connecting Video to LLMs

Core Components:

Video Encoder: ViT-based spatiotemporal transformer
Q-Former: Query transformer that compresses video features
LLM Backbone: Large language model (e.g., Vicuna, LLaMA)

The Q-Former Bridge:

The Q-Former (inspired by BLIP-2) is critical for efficiency:

Raw video features: 16 frames × 196 patches = 3,136 tokens
                                              ↓
                                          Q-Former
                                              ↓
Compressed features: 32 learned query tokens

This 100:1 compression allows the LLM to process video context without overwhelming it with tokens.

How it works:

class QFormer(nn.Module):
    def __init__(self, num_queries=32, hidden_dim=768):
        self.queries = nn.Parameter(torch.randn(num_queries, hidden_dim))
        self.cross_attention = nn.MultiheadAttention(hidden_dim, num_heads=12)

    def forward(self, video_features):
        # video_features: [3136, 768] (all frame patches)
        # queries: [32, 768] (learned)

        # Cross-attention: queries attend to video features
        output, _ = self.cross_attention(
            query=self.queries,
            key=video_features,
            value=video_features
        )
        return output  # [32, 768] - compressed representation

The learned queries act as "summary questions" about the video:

Query 1 might learn to focus on "who is present"
Query 2 might learn to focus on "what action is happening"
Query 3 might learn to focus on "where is this taking place"

Progressive Training Strategy

InternVideo2 doesn't train all components jointly. Instead, it uses a stage-wise curriculum:

Stage 1: Video-Text Contrastive Learning

Objective: Align video encoder with text descriptions
Data: Large-scale video-caption pairs (WebVid-10M)
Frozen: LLM backbone
Training: Video encoder + Q-Former

Stage 2: Video-Text Matching

Objective: Fine-grained understanding of which caption matches which video
Data: Curated high-quality pairs (MSR-VTT, ActivityNet)
Frozen: Video encoder (partially)
Training: Q-Former + LLM (LoRA fine-tuning)

Stage 3: Video-Grounded Generation

Objective: Generate detailed descriptions and answer questions
Data: Instruction-tuning datasets (VideoChat, Valley)
Frozen: Video encoder
Training: Q-Former + LLM (full fine-tuning)

Stage 4: Multi-Task Fine-Tuning

Objective: Generalization across diverse tasks
Data: Mixed tasks (retrieval, QA, captioning, classification)
Frozen: None
Training: End-to-end with low learning rate

Strengths of InternVideo2

1. Complex Video Question Answering

InternVideo2 can handle multi-hop reasoning:

Question: "What does the person do after picking up the red object but before leaving the room?"

Video: 2-minute sequence with multiple actions

InternVideo2 answer: "They place the red object on the table, look around briefly, then walk toward the door."

This requires:

Object tracking (red object)
Action sequence understanding (picking up → placing → walking)
Temporal segmentation (after X, before Y)

2. Long-Form Video Understanding

By compressing video to 32 tokens via Q-Former, InternVideo2 can process longer videos:

Standard approach: 16 frames × 196 patches = 3,136 tokens → LLM chokes
InternVideo2: 32 tokens → LLM handles comfortably

Can process:

30-second clips at 2 fps → 60 frames → 32 tokens
2-minute clips at 0.5 fps → 60 frames → 32 tokens

3. Zero-Shot Transfer

Training on diverse data creates strong generalization:

Benchmark	Task	Zero-Shot Accuracy
MSR-VTT	Retrieval	51.7% R@1
MSVD	Retrieval	61.3% R@1
ActivityNet	Caption	42.8 CIDEr
NExT-QA	Video QA	67.4% Accuracy

VideoMAE V2 vs InternVideo2: Which to Use?

Aspect	VideoMAE V2	InternVideo2
Training Paradigm	Self-supervised reconstruction	Supervised multimodal alignment
Best For	Embeddings for retrieval/search	Video understanding & QA
Data Requirements	Unlabeled video (millions)	Labeled video-text pairs
Temporal Modeling	Explicit (3D tubes)	Implicit (via LLM reasoning)
Computational Cost	High (dual masking helps)	Medium (Q-Former compression)
Fine-Tuning	Excellent for specific domains	Excellent for multi-task

Recommendation:

Use VideoMAE V2 if you need high-quality embeddings for large-scale video retrieval systems
Use InternVideo2 if you need video understanding for chatbots, QA systems, or content analysis

The UVRB Benchmark: Testing True Understanding

What Makes UVRB Different?

The Universal Video Retrieval Benchmark (UVRB) was created to expose the limitations of existing benchmarks, which often have dataset-specific biases.

16 Diverse Datasets Spanning:

Sports: Kinetics, UCF-101
Movies & TV: LSMDC, DiDeMo
Surveillance: ActivityNet
Instructional: YouCook2, Coin
Generic: MSR-VTT, MSVD

Dimensional Diagnostic Evaluation:

UVRB breaks down performance by ability:

Spatial Understanding (Object, Scene, Appearance)
- "Find videos with red cars"
- "Retrieve clips in outdoor settings"
Temporal Understanding (Action, Motion, Sequence)
- "Find videos where someone jumps before running"
- "Retrieve clips with fast camera motion"
Cross-Modal Reasoning (Text-to-Video alignment)
- "A chef preparing pasta" → Must understand cooking actions, not just kitchen setting

Example Results:

Model	Spatial Avg	Temporal Avg	Cross-Modal Avg	Overall
CLIP (frame avg)	68.3%	31.2%	45.7%	48.4%
ImageBind	64.1%	38.6%	47.2%	50.0%
LanguageBind	71.2%	42.7%	53.8%	55.9%
VideoMAE V2	72.8%	67.8%	64.3%	68.3%
InternVideo2	74.6%	65.2%	71.9%	70.6%

Key insights:

CLIP's massive spatial-temporal gap (68.3% vs 31.2%) confirms the spatial bias problem
VideoMAE V2's strong temporal performance validates the spatiotemporal tube architecture
InternVideo2's cross-modal strength shows the value of explicit language alignment

The Generalization Challenge

UVRB also tests cross-dataset generalization:

Train on: Kinetics + MSR-VTT
Test on: DiDeMo (held-out)

Model	Within-Dataset	Cross-Dataset	Generalization Gap
CLIP	56.2%	31.4%	-24.8% (poor)
VideoMAE V2	71.3%	58.7%	-12.6%
InternVideo2	73.8%	64.2%	-9.6% (best)
+ Pyramid Training	74.1%	68.9%	-5.2% (excellent)

The Modality Pyramid training significantly improves generalization by forcing the model to learn robust spatiotemporal features rather than dataset-specific shortcuts.

Long-Form Video Handling

The Challenge of Scale

Most research focuses on short clips (5-30 seconds). But real-world applications need to handle:

Full movies (90-180 minutes)
Lectures (45-90 minutes)
Surveillance footage (hours to days)
Podcast videos (30-120 minutes)

The computational problem:

A 60-minute video at 1 fps:

3,600 frames
VideoMAE V2 processing: ~120 seconds on A100 GPU
Cost per video: $0.50-1.00

For a 100K video archive: $50K-100K just for initial encoding.

LV-MAE: Long Video Masked Autoencoder

LV-MAE (Long Video MAE) proposes hierarchical processing:

Level 1: Clip-Level Embeddings

Divide video into 10-second clips
Process each clip with VideoMAE V2 → 1408-dim embedding
60-minute video → 360 embeddings

Level 2: Attentive Probing

Learn a smaller "probe" network that attends over clip embeddings
Compress 360 embeddings → 32 "super-embeddings"

class AttentiveProbe(nn.Module):
    def __init__(self, num_probes=32):
        self.probes = nn.Parameter(torch.randn(num_probes, 1408))
        self.attention = nn.MultiheadAttention(1408, num_heads=8)

    def forward(self, clip_embeddings):
        # clip_embeddings: [360, 1408] for 60-min video
        # probes: [32, 1408]

        output, weights = self.attention(
            query=self.probes,
            key=clip_embeddings,
            value=clip_embeddings
        )
        return output  # [32, 1408] super-embeddings

Benefits:

Scalability: Process clips in parallel, then aggregate
Flexibility: Different probe configurations for different tasks
Interpretability: Attention weights show which clips are important

Results on COIN benchmark (long instructional videos):

Approach	Avg Clip Length	Accuracy	Speed
Uniform sampling (8 frames)	60s	42.3%	Fast
Dense VideoMAE (all frames)	60s	71.2%	Very Slow
LV-MAE (Attentive Probing)	60s	68.9%	Medium

LV-MAE achieves 97% of dense processing accuracy at 3x the speed.

Practical Implementation Strategies

Strategy 1: Hierarchical Processing

For production systems, use a tiered approach:

Tier 1: Fast Filtering (Frame-Level)

Use lightweight image encoder (e.g., CLIP-ViT-B)
Process every 5th frame
Purpose: Quickly filter out irrelevant videos

Tier 2: Medium Quality (Clip-Level)

Use VLM2Vec-V2 (2B) on 10-second clips
Process key moments identified in Tier 1
Purpose: Narrow down to top 100 candidates

Tier 3: High Quality (Full Video)

Use VideoMAE V2 or InternVideo2 on full video
Only process top 20 candidates from Tier 2
Purpose: Final ranking with full temporal context

Cost comparison (per 1000 videos, 60s each):

Approach	GPU Hours (A100)	Cost	Accuracy
All Tier 3	167 hours	$500	85%
Hierarchical	8 hours (T1) + 2 hours (T2) + 0.5 hours (T3)	$32	83%

Result: 15x cost reduction with only 2% accuracy loss.

Strategy 2: Semantic Chunking

Don't chunk by time—chunk by meaning:

def semantic_chunking(video_path, similarity_threshold=0.85):
    """
    Create chunks based on semantic similarity between consecutive frames.
    """
    frames = extract_frames(video_path, fps=1)
    embeddings = [encode_frame(f) for f in frames]

    chunks = []
    current_chunk = [0]

    for i in range(1, len(embeddings)):
        similarity = cosine_sim(embeddings[i-1], embeddings[i])

        if similarity < similarity_threshold:
            # Scene change detected
            chunks.append(current_chunk)
            current_chunk = [i]
        else:
            current_chunk.append(i)

    chunks.append(current_chunk)
    return chunks

Benefits:

Natural scene boundaries
Variable-length chunks match content complexity
Better for downstream RAG applications

Example:

Time-based chunking (every 10s):

Chunk 1: [0-10s] - Mid-sentence dialogue
Chunk 2: [10-20s] - Scene change at 18s splits context

Semantic chunking:

Chunk 1: [0-18s] - Complete dialogue scene
Chunk 2: [18-35s] - Complete action sequence

Strategy 3: Hybrid Embeddings

Combine multiple models for robustness:

def hybrid_video_embedding(video_path):
    # Spatial features (what)
    spatial_emb = clip_encode(sample_frames(video_path, n=8))

    # Temporal features (how)
    temporal_emb = videomae_encode(video_path)

    # Language features (why)
    caption = generate_caption(video_path)
    language_emb = text_encoder(caption)

    # Weighted combination
    hybrid = 0.3 * spatial_emb + 0.5 * temporal_emb + 0.2 * language_emb
    return hybrid

When to use hybrid:

Diverse query types (some focus on objects, others on actions)
Noisy videos (one model's weakness is another's strength)
Cold-start scenarios (limited training data)

Strategy 4: Progressive Enhancement

Start simple, add complexity:

Phase 1: MVP

Frame averaging with CLIP
Time-based chunking
Flat vector index (FAISS)

Phase 2: Temporal

VLM2Vec-V2 (2B)
Semantic chunking
Hierarchical index (HNSW)

Phase 3: Full Production

VideoMAE V2 for embeddings
LV-MAE for long videos
Hybrid search with metadata
Reranking with InternVideo2

Timeline:

Phase 1: 1-2 weeks, prove concept
Phase 2: 4-6 weeks, scale to 100K videos
Phase 3: 12-16 weeks, production-grade system

The Road Ahead

Video embedding technology has matured dramatically. We've moved from naive frame averaging to sophisticated spatiotemporal transformers that truly understand motion, action, and temporal relationships.

Key takeaways:

Spatial bias is real: Test your models specifically for temporal understanding
Self-supervised learning works: VideoMAE V2 proves unlabeled data is valuable
Architecture matters: 3D tubes and dual masking enable efficient scaling
Hierarchical processing is essential: Don't process every frame the same way
Combine approaches: Use reconstruction models for embeddings, alignment models for reasoning

In the next post, we'll explore the infrastructure side: how to index and retrieve these embeddings at scale. We'll cover HNSW graphs, hierarchical indexing strategies, and building production-grade video RAG systems that can handle billions of video frames.

Get in Touch

Need help implementing video understanding systems? Want to discuss spatiotemporal architecture strategies or custom model training?

Connect with me:

📧 Email: [email protected]
🐦 Twitter/X: @TheDataGuyPro
💼 LinkedIn: Muhammad Afzaal
💻 GitHub: @mafzaal
🎥 YouTube: @TheDataGuyPro
🎧 Podcast: TheDataGuy Show

Whether you're looking for consulting services, training, or just want to discuss video AI strategies, I'd love to hear from you!

References

Wang, L., et al. (2024). VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking. CVPR 2024.
Chen, X., et al. (2024). InternVideo2: Progressive Multimodal Video Understanding. arXiv preprint arXiv:2403.15377.
Zhang, Y., et al. (2024). Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum. arXiv preprint arXiv:2510.27571.
Tong, Z., et al. (2022). VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. NeurIPS 2022.
Li, J., et al. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. ICML 2023.
Feichtenhofer, C., et al. (2022). Masked Autoencoders As Spatiotemporal Learners. NeurIPS 2022.
Liu, Z., et al. (2024). LV-MAE: Long Video Masked Autoencoders with Hierarchical Attention. arXiv preprint arXiv:2408.12345.
Bertasius, G., Wang, H., & Torresani, L. (2021). Is Space-Time Attention All You Need for Video Understanding? ICML 2021.
Arnab, A., et al. (2021). ViViT: A Video Vision Transformer. ICCV 2021.
Goyal, R., et al. (2017). The "Something Something" Video Database for Learning and Evaluating Visual Common Sense. ICCV 2017.
Kay, W., et al. (2017). The Kinetics Human Action Video Dataset. arXiv preprint arXiv:1705.06950.
Bain, M., et al. (2021). Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. ICCV 2021.

Contents