SOTAVerified

Video Retrieval

The objective of video retrieval is as follows: given a text query and a pool of candidate videos, select the video which corresponds to the text query. Typically, the videos are returned as a ranked list of candidates and scored via document retrieval metrics.

Papers

Showing 150 of 486 papers

TitleStatusHype
InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingCode7
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and VideoCode4
InternVideo: General Video Foundation Models via Generative and Discriminative LearningCode4
VideoRoPE: What Makes for Good Video Rotary Position Embedding?Code3
Video-RAG: Visually-aligned Retrieval-Augmented Long Video ComprehensionCode3
Composed Multi-modal Retrieval: A Survey of Approaches and ApplicationsCode2
Gramian Multimodal Representation Learning and AlignmentCode2
Explore the Limits of Omni-modal Pretraining at ScaleCode2
Composed Video Retrieval via Enriched Context and Discriminative EmbeddingsCode2
EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real WorldCode2
vid-TLDR: Training Free Token merging for Light-weight Video TransformerCode2
Multi-granularity Correspondence Learning from Long-term Noisy VideosCode2
Animate-A-Story: Storytelling with Retrieval-Augmented Video GenerationCode2
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and DatasetCode2
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and DatasetCode2
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?Code2
X^2-VLM: All-In-One Pre-trained Model For Vision-Language TasksCode2
Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive LearningCode2
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation AlignmentCode2
Revealing Single Frame Bias for Video-and-Language LearningCode2
All in One: Exploring Unified Video-Language Pre-trainingCode2
LoVR: A Benchmark for Long Video Retrieval in Multimodal ContextsCode1
Video-GPT via Next Clip DiffusionCode1
StableFusion: Continual Video Retrieval via Frame AdaptationCode1
Text Proxy: Decomposing Retrieval from a 1-to-N Relationship into N 1-to-1 Relationships for Text-Video RetrievalCode1
TempMe: Video Temporal Token Merging for Efficient Text-Video RetrievalCode1
T2VIndexer: A Generative Video Indexer for Efficient Text-Video RetrievalCode1
MUSE: Mamba is Efficient Multi-scale Learner for Text-video RetrievalCode1
EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video RetrievalCode1
Referring Atomic Video Action RecognitionCode1
GMMFormer v2: An Uncertainty-aware Framework for Partially Relevant Video RetrievalCode1
Text-Video Retrieval with Global-Local Semantic Consistent LearningCode1
DGL: Dynamic Global-Local Prompt Tuning for Text-Video RetrievalCode1
Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation LearningCode1
Holistic Features are almost Sufficient for Text-to-Video RetrievalCode1
Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot VideosCode1
Let All be Whitened: Multi-teacher Distillation for Efficient Visual RetrievalCode1
RTQ: Rethinking Video-language Understanding Based on Image-text ModelCode1
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer LearningCode1
VideoCon: Robust Video-Language Alignment via Contrast CaptionsCode1
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language UnderstandingCode1
Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and DataCode1
GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient Partially Relevant Video RetrievalCode1
HowToCaption: Prompting LLMs to Transform Video Annotations at ScaleCode1
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal RetrievalCode1
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive LearningCode1
Unified Coarse-to-Fine Alignment for Video-Text RetrievalCode1
In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video RetrievalCode1
CoVR-2: Automatic Data Construction for Composed Video RetrievalCode1
Simple Baselines for Interactive Video Retrieval with Questions and AnswersCode1
Show:102550
← PrevPage 1 of 10Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1OmniVectext-to-video R@1089.4Unverified
2CLIP4Cliptext-to-video R@1081.6Unverified
3OmniVec (pretrained)text-to-video R@1078.6Unverified
4HunYuan_tvr (huge)text-to-video R@162.9Unverified
5CLIP-ViPtext-to-video R@157.7Unverified
6PIDRotext-to-video R@155.9Unverified
7DMAE (ViT-B/16)text-to-video R@155.5Unverified
8HunYuan_tvrtext-to-video R@155Unverified
9MuLTItext-to-video R@154.7Unverified
10EERCFtext-to-video R@154.1Unverified
#ModelMetricClaimedVerifiedStatus
1Aurora (ours, r=64)text-to-video R@577.4Unverified
2InternVideo2-6Btext-to-video R@174.2Unverified
3vid-TLDR (UMT-L)text-to-video R@172.3Unverified
4VASTtext-to-video R@172Unverified
5COSAtext-to-video R@170.5Unverified
6UMT-L (ViT-L/16)text-to-video R@170.4Unverified
7GRAMtext-to-video R@167.3Unverified
8VALORtext-to-video R@161.5Unverified
9TESTA (ViT-B/16)text-to-video R@161.2Unverified
10VindLUtext-to-video R@161.2Unverified
#ModelMetricClaimedVerifiedStatus
1GRAMtext-to-video R@164Unverified
2VASTtext-to-video R@163.9Unverified
3InternVideo2-6Btext-to-video R@162.8Unverified
4VALORtext-to-video R@159.9Unverified
5UMT-L (ViT-L/16)text-to-video R@158.8Unverified
6vid-TLDR (UMT-L)text-to-video R@158.1Unverified
7COSAtext-to-video R@157.9Unverified
8InternVideo2-6Btext-to-video R@155.9Unverified
9InternVideotext-to-video R@155.2Unverified
10VLABtext-to-video R@155.1Unverified
#ModelMetricClaimedVerifiedStatus
1EMCL-Net (Ours)++ LSMDC Rohrbach et al. (2015)text-to-video R@1053.7Unverified
2InternVideo2-6Btext-to-video R@146.4Unverified
3vid-TLDR (UMT-L)text-to-video R@143.1Unverified
4UMT-L (ViT-L/16)text-to-video R@143Unverified
5HunYuan_tvr (huge)text-to-video R@140.4Unverified
6COSAtext-to-video R@139.4Unverified
7mPLUG-2text-to-video R@134.4Unverified
8VALORtext-to-video R@134.2Unverified
9InternVideotext-to-video R@134Unverified
10InternVideo2-6Btext-to-video R@133.8Unverified