SOTAVerified

Zero-Shot Video Retrieval

Zero-shot video retrieval is the task of retrieving relevant videos based on a query (usually in text form) without any prior training on specific examples of those videos. Unlike traditional retrieval methods that rely on supervised learning with annotated datasets, zero-shot retrieval leverages pre-trained models, typically based on large-scale vision-language learning, to understand semantic relationships between textual descriptions and video content.

This approach enables retrieval of unseen video concepts by generalizing knowledge from diverse training data, making it highly useful for domains with limited labeled data, such as broadcast media, surveillance, and historical archives.

Papers

Showing 110 of 40 papers

TitleStatusHype
Make Your Training Flexible: Towards Deployment-Efficient Video ModelsCode1
Gramian Multimodal Representation Learning and AlignmentCode2
InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingCode7
vid-TLDR: Training Free Token merging for Light-weight Video TransformerCode2
OmniVec2 - A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning0
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic TasksCode1
HowToCaption: Prompting LLMs to Transform Video Annotations at ScaleCode1
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic AlignmentCode4
BT-Adapter: Video Conversation is Feasible Without Video Instruction TuningCode1
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and DatasetCode2
Show:102550
← PrevPage 1 of 4Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1InternVideo2-1Btext-to-video R@170.4Unverified
2VideoCoCatext-to-video R@153.2Unverified