SOTAVerified

Zero-Shot Video Retrieval

Zero-shot video retrieval is the task of retrieving relevant videos based on a query (usually in text form) without any prior training on specific examples of those videos. Unlike traditional retrieval methods that rely on supervised learning with annotated datasets, zero-shot retrieval leverages pre-trained models, typically based on large-scale vision-language learning, to understand semantic relationships between textual descriptions and video content.

This approach enables retrieval of unseen video concepts by generalizing knowledge from diverse training data, making it highly useful for domains with limited labeled data, such as broadcast media, surveillance, and historical archives.

Papers

Showing 1120 of 40 papers

TitleStatusHype
ImageBind: One Embedding Space To Bind Them AllCode5
Unmasked Teacher: Towards Training-Efficient Video Foundation ModelsCode0
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and VideoCode4
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training0
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners0
InternVideo: General Video Foundation Models via Generative and Discriminative LearningCode4
Seeing What You Miss: Vision-Language Pre-training with Semantic Completion LearningCode1
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks0
Clover: Towards A Unified Video-Language Alignment and Fusion ModelCode1
LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval0
Show:102550
← PrevPage 2 of 4Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1InternVideo2-1Btext-to-video R@151.9Unverified
2VAST, HowToCaption-finetunedtext-to-video R@150Unverified
3FluxViT-Btext-to-video R@149.9Unverified
4mPLUG-2text-to-video R@147.1Unverified
5FluxViT-Stext-to-video R@145Unverified
6LanguageBind(ViT-H/14)text-to-video R@144.8Unverified
7LanguageBind(ViT-L/14)text-to-video R@142.8Unverified
8BT-Adaptertext-to-video R@140.9Unverified
9Florencetext-to-video R@137.6Unverified
10HowToCaptiontext-to-video R@137.6Unverified
#ModelMetricClaimedVerifiedStatus
1InternVideo2-1Btext-to-video R@157Unverified
2HiTeA-17Mtext-to-video R@143.2Unverified
3LanguageBind(ViT-H/14)text-to-video R@139.9Unverified
4LanguageBind(ViT-L/14)text-to-video R@139.7Unverified
5Singularity-17Mtext-to-video R@137.1Unverified
6Singularity-5Mtext-to-video R@136.9Unverified
7HiTeA-5Mtext-to-video R@136.1Unverified
8BT-Adaptertext-to-video R@135.6Unverified
9MILEStext-to-video R@127.2Unverified
10Y. Ge et. al.text-to-video R@125.6Unverified
#ModelMetricClaimedVerifiedStatus
1InternVideo2-1Btext-to-video R@132Unverified
2VAST, HowToCaption-finetunedtext-to-video R@127.7Unverified
3BT-Adaptertext-to-video R@119.5Unverified
4HiTeA-17Mtext-to-video R@118.3Unverified
5HowToCaptiontext-to-video R@117.3Unverified
6Yatai Ji et. al.text-to-video R@117.2Unverified
7HiTeA-5Mtext-to-video R@115.5Unverified
8Y. Ge et. al.text-to-video R@112.2Unverified
9MILEStext-to-video R@111.1Unverified
10SSMLtext-to-video R@14.2Unverified
#ModelMetricClaimedVerifiedStatus
1InternVideo2-1Btext-to-video R@158.1Unverified
2VAST, HowToCaption-finetunedtext-to-video R@154.8Unverified
3LanguageBind(ViT-L/14)text-to-video R@154.1Unverified
4LanguageBind(ViT-H/14)text-to-video R@153.9Unverified
5UMT-L (ViT-L/16)text-to-video R@149Unverified
6HowToCaptiontext-to-video R@144.5Unverified
7MILEStext-to-video R@144.4Unverified
8Y. Ge et. al.text-to-video R@143.6Unverified
9LaTtext-to-video R@136.9Unverified
#ModelMetricClaimedVerifiedStatus
1InternVideo2-1Btext-to-video R@160.4Unverified
2LanguageBind(ViT-H/14)text-to-video R@141Unverified
3LanguageBind(ViT-L/14)text-to-video R@138.4Unverified
4BT-Adaptertext-to-video R@137Unverified
5VideoCoCatext-to-video R@134.5Unverified
6Singularity-temporal-5Mtext-to-video R@130.8Unverified
7Singularity-temporal-17Mtext-to-video R@130.6Unverified
#ModelMetricClaimedVerifiedStatus
1VATT-MBStext-to-video R@1045.5Unverified
2OmniVec2text-to-video R@126.1Unverified
3Nortontext-to-video R@124.2Unverified
4VideoCOcatext-to-video R@120.3Unverified
5VAST, HowToCaption-finetunedtext-to-video R@119.7Unverified
6MIL-NCEtext-to-video R@115.1Unverified
7HowToCaptiontext-to-video R@113.4Unverified
#ModelMetricClaimedVerifiedStatus
1InternVL-Gtext-to-video R@146.3Unverified
2InternVL-Ctext-to-video R@144.7Unverified
3VideoCoCatext-to-video R@134.3Unverified
#ModelMetricClaimedVerifiedStatus
1InternVideo2-1Btext-to-video R@170.4Unverified
2VideoCoCatext-to-video R@153.2Unverified