SOTAVerified

Zero-Shot Video Retrieval

Zero-shot video retrieval is the task of retrieving relevant videos based on a query (usually in text form) without any prior training on specific examples of those videos. Unlike traditional retrieval methods that rely on supervised learning with annotated datasets, zero-shot retrieval leverages pre-trained models, typically based on large-scale vision-language learning, to understand semantic relationships between textual descriptions and video content.

This approach enables retrieval of unseen video concepts by generalizing knowledge from diverse training data, making it highly useful for domains with limited labeled data, such as broadcast media, surveillance, and historical archives.

Papers

Showing 110 of 40 papers

TitleStatusHype
Make Your Training Flexible: Towards Deployment-Efficient Video ModelsCode1
Gramian Multimodal Representation Learning and AlignmentCode2
InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingCode7
vid-TLDR: Training Free Token merging for Light-weight Video TransformerCode2
OmniVec2 - A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning0
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic TasksCode1
HowToCaption: Prompting LLMs to Transform Video Annotations at ScaleCode1
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic AlignmentCode4
BT-Adapter: Video Conversation is Feasible Without Video Instruction TuningCode1
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and DatasetCode2
Show:102550
← PrevPage 1 of 4Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1InternVideo2-1Btext-to-video R@151.9Unverified
2VAST, HowToCaption-finetunedtext-to-video R@150Unverified
3FluxViT-Btext-to-video R@149.9Unverified
4mPLUG-2text-to-video R@147.1Unverified
5FluxViT-Stext-to-video R@145Unverified
6LanguageBind(ViT-H/14)text-to-video R@144.8Unverified
7LanguageBind(ViT-L/14)text-to-video R@142.8Unverified
8BT-Adaptertext-to-video R@140.9Unverified
9HowToCaptiontext-to-video R@137.6Unverified
10Florencetext-to-video R@137.6Unverified
#ModelMetricClaimedVerifiedStatus
1InternVideo2-1Btext-to-video R@157Unverified
2HiTeA-17Mtext-to-video R@143.2Unverified
3LanguageBind(ViT-H/14)text-to-video R@139.9Unverified
4LanguageBind(ViT-L/14)text-to-video R@139.7Unverified
5Singularity-17Mtext-to-video R@137.1Unverified
6Singularity-5Mtext-to-video R@136.9Unverified
7HiTeA-5Mtext-to-video R@136.1Unverified
8BT-Adaptertext-to-video R@135.6Unverified
9MILEStext-to-video R@127.2Unverified
10Y. Ge et. al.text-to-video R@125.6Unverified
#ModelMetricClaimedVerifiedStatus
1InternVideo2-1Btext-to-video R@132Unverified
2VAST, HowToCaption-finetunedtext-to-video R@127.7Unverified
3BT-Adaptertext-to-video R@119.5Unverified
4HiTeA-17Mtext-to-video R@118.3Unverified
5HowToCaptiontext-to-video R@117.3Unverified
6Yatai Ji et. al.text-to-video R@117.2Unverified
7HiTeA-5Mtext-to-video R@115.5Unverified
8Y. Ge et. al.text-to-video R@112.2Unverified
9MILEStext-to-video R@111.1Unverified
10SSMLtext-to-video R@14.2Unverified
#ModelMetricClaimedVerifiedStatus
1InternVideo2-1Btext-to-video R@158.1Unverified
2VAST, HowToCaption-finetunedtext-to-video R@154.8Unverified
3LanguageBind(ViT-L/14)text-to-video R@154.1Unverified
4LanguageBind(ViT-H/14)text-to-video R@153.9Unverified
5UMT-L (ViT-L/16)text-to-video R@149Unverified
6HowToCaptiontext-to-video R@144.5Unverified
7MILEStext-to-video R@144.4Unverified
8Y. Ge et. al.text-to-video R@143.6Unverified
9LaTtext-to-video R@136.9Unverified
#ModelMetricClaimedVerifiedStatus
1InternVideo2-1Btext-to-video R@160.4Unverified
2LanguageBind(ViT-H/14)text-to-video R@141Unverified
3LanguageBind(ViT-L/14)text-to-video R@138.4Unverified
4BT-Adaptertext-to-video R@137Unverified
5VideoCoCatext-to-video R@134.5Unverified
6Singularity-temporal-5Mtext-to-video R@130.8Unverified
7Singularity-temporal-17Mtext-to-video R@130.6Unverified
#ModelMetricClaimedVerifiedStatus
1VATT-MBStext-to-video R@1045.5Unverified
2OmniVec2text-to-video R@126.1Unverified
3Nortontext-to-video R@124.2Unverified
4VideoCOcatext-to-video R@120.3Unverified
5VAST, HowToCaption-finetunedtext-to-video R@119.7Unverified
6MIL-NCEtext-to-video R@115.1Unverified
7HowToCaptiontext-to-video R@113.4Unverified
#ModelMetricClaimedVerifiedStatus
1InternVL-Gtext-to-video R@146.3Unverified
2InternVL-Ctext-to-video R@144.7Unverified
3VideoCoCatext-to-video R@134.3Unverified
#ModelMetricClaimedVerifiedStatus
1InternVideo2-1Btext-to-video R@170.4Unverified
2VideoCoCatext-to-video R@153.2Unverified