Zero-Shot Video Retrieval
Zero-shot video retrieval is the task of retrieving relevant videos based on a query (usually in text form) without any prior training on specific examples of those videos. Unlike traditional retrieval methods that rely on supervised learning with annotated datasets, zero-shot retrieval leverages pre-trained models, typically based on large-scale vision-language learning, to understand semantic relationships between textual descriptions and video content.
This approach enables retrieval of unseen video concepts by generalizing knowledge from diverse training data, making it highly useful for domains with limited labeled data, such as broadcast media, surveillance, and historical archives.
Papers
Showing 1–10 of 40 papers
Benchmark Results
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | InternVideo2-1B | text-to-video R@1 | 51.9 | — | Unverified |
| 2 | VAST, HowToCaption-finetuned | text-to-video R@1 | 50 | — | Unverified |
| 3 | FluxViT-B | text-to-video R@1 | 49.9 | — | Unverified |
| 4 | mPLUG-2 | text-to-video R@1 | 47.1 | — | Unverified |
| 5 | FluxViT-S | text-to-video R@1 | 45 | — | Unverified |
| 6 | LanguageBind(ViT-H/14) | text-to-video R@1 | 44.8 | — | Unverified |
| 7 | LanguageBind(ViT-L/14) | text-to-video R@1 | 42.8 | — | Unverified |
| 8 | BT-Adapter | text-to-video R@1 | 40.9 | — | Unverified |
| 9 | HowToCaption | text-to-video R@1 | 37.6 | — | Unverified |
| 10 | Florence | text-to-video R@1 | 37.6 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | InternVideo2-1B | text-to-video R@1 | 57 | — | Unverified |
| 2 | HiTeA-17M | text-to-video R@1 | 43.2 | — | Unverified |
| 3 | LanguageBind(ViT-H/14) | text-to-video R@1 | 39.9 | — | Unverified |
| 4 | LanguageBind(ViT-L/14) | text-to-video R@1 | 39.7 | — | Unverified |
| 5 | Singularity-17M | text-to-video R@1 | 37.1 | — | Unverified |
| 6 | Singularity-5M | text-to-video R@1 | 36.9 | — | Unverified |
| 7 | HiTeA-5M | text-to-video R@1 | 36.1 | — | Unverified |
| 8 | BT-Adapter | text-to-video R@1 | 35.6 | — | Unverified |
| 9 | MILES | text-to-video R@1 | 27.2 | — | Unverified |
| 10 | Y. Ge et. al. | text-to-video R@1 | 25.6 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | InternVideo2-1B | text-to-video R@1 | 32 | — | Unverified |
| 2 | VAST, HowToCaption-finetuned | text-to-video R@1 | 27.7 | — | Unverified |
| 3 | BT-Adapter | text-to-video R@1 | 19.5 | — | Unverified |
| 4 | HiTeA-17M | text-to-video R@1 | 18.3 | — | Unverified |
| 5 | HowToCaption | text-to-video R@1 | 17.3 | — | Unverified |
| 6 | Yatai Ji et. al. | text-to-video R@1 | 17.2 | — | Unverified |
| 7 | HiTeA-5M | text-to-video R@1 | 15.5 | — | Unverified |
| 8 | Y. Ge et. al. | text-to-video R@1 | 12.2 | — | Unverified |
| 9 | MILES | text-to-video R@1 | 11.1 | — | Unverified |
| 10 | SSML | text-to-video R@1 | 4.2 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | InternVideo2-1B | text-to-video R@1 | 58.1 | — | Unverified |
| 2 | VAST, HowToCaption-finetuned | text-to-video R@1 | 54.8 | — | Unverified |
| 3 | LanguageBind(ViT-L/14) | text-to-video R@1 | 54.1 | — | Unverified |
| 4 | LanguageBind(ViT-H/14) | text-to-video R@1 | 53.9 | — | Unverified |
| 5 | UMT-L (ViT-L/16) | text-to-video R@1 | 49 | — | Unverified |
| 6 | HowToCaption | text-to-video R@1 | 44.5 | — | Unverified |
| 7 | MILES | text-to-video R@1 | 44.4 | — | Unverified |
| 8 | Y. Ge et. al. | text-to-video R@1 | 43.6 | — | Unverified |
| 9 | LaT | text-to-video R@1 | 36.9 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | InternVideo2-1B | text-to-video R@1 | 60.4 | — | Unverified |
| 2 | LanguageBind(ViT-H/14) | text-to-video R@1 | 41 | — | Unverified |
| 3 | LanguageBind(ViT-L/14) | text-to-video R@1 | 38.4 | — | Unverified |
| 4 | BT-Adapter | text-to-video R@1 | 37 | — | Unverified |
| 5 | VideoCoCa | text-to-video R@1 | 34.5 | — | Unverified |
| 6 | Singularity-temporal-5M | text-to-video R@1 | 30.8 | — | Unverified |
| 7 | Singularity-temporal-17M | text-to-video R@1 | 30.6 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | VATT-MBS | text-to-video R@10 | 45.5 | — | Unverified |
| 2 | OmniVec2 | text-to-video R@1 | 26.1 | — | Unverified |
| 3 | Norton | text-to-video R@1 | 24.2 | — | Unverified |
| 4 | VideoCOca | text-to-video R@1 | 20.3 | — | Unverified |
| 5 | VAST, HowToCaption-finetuned | text-to-video R@1 | 19.7 | — | Unverified |
| 6 | MIL-NCE | text-to-video R@1 | 15.1 | — | Unverified |
| 7 | HowToCaption | text-to-video R@1 | 13.4 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | InternVL-G | text-to-video R@1 | 46.3 | — | Unverified |
| 2 | InternVL-C | text-to-video R@1 | 44.7 | — | Unverified |
| 3 | VideoCoCa | text-to-video R@1 | 34.3 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | InternVideo2-1B | text-to-video R@1 | 70.4 | — | Unverified |
| 2 | VideoCoCa | text-to-video R@1 | 53.2 | — | Unverified |