SOTAVerified

Video Retrieval

The objective of video retrieval is as follows: given a text query and a pool of candidate videos, select the video which corresponds to the text query. Typically, the videos are returned as a ranked list of candidates and scored via document retrieval metrics.

Papers

Showing 251300 of 486 papers

TitleStatusHype
A CLIP-Hitchhiker's Guide to Long Video RetrievalCode1
Learning to Retrieve Videos by Asking QuestionsCode0
TransRank: Self-supervised Video Representation Learning via Ranking-based Transformation RecognitionCode1
CoCa: Contrastive Captioners are Image-Text Foundation ModelsCode1
CenterCLIP: Token Clustering for Efficient Text-Video RetrievalCode1
Learn to Understand Negation in Video RetrievalCode0
Relevance-based Margin for Contrastively-trained Video Retrieval ModelsCode0
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text RetrievalCode1
A Survey of Video-based Action Quality Assessment0
Modality-Balanced Embedding for Video Retrieval0
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval0
Exploring the Temporal Cues to Enhance Video Retrieval on Standardized CDVACode0
Probabilistic Representations for Video Contrastive Learning0
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations0
Temporal Alignment Networks for Long-term VideoCode1
ECLIPSE: Efficient Long-range Video Retrieval using Sight and SoundCode1
Learning Audio-Video Modalities from Image Captions0
Socratic Models: Composing Zero-Shot Multimodal Reasoning with LanguageCode0
CREATE: A Benchmark for Chinese Short Video Retrieval and Title Generation0
Controllable Augmentations for Video Representation Learning0
X-Pool: Cross-Modal Language-Video Attention for Text-Video RetrievalCode1
FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding TasksCode0
Learning video retrieval models with relevance-aware online miningCode1
Revitalize Region Feature for Democratizing Video-Language Pre-training of RetrievalCode1
Show Me More Details: Discovering Hierarchies of Procedures from Semi-structured Web DataCode1
All in One: Exploring Unified Video-Language Pre-trainingCode2
MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization0
Disentangled Representation Learning for Text-Video RetrievalCode1
Live Laparoscopic Video Retrieval with Compressed Uncertainty0
VScript: Controllable Script Generation with Visual Presentation0
NEWSKVQA: Knowledge-Aware News Video Question Answering0
Hybrid Contrastive Quantization for Efficient Cross-View Video RetrievalCode1
Reading-strategy Inspired Visual Representation Learning for Text-to-Video RetrievalCode1
Self-supervised Video Representation Learning with Cascade Positive RetrievalCode0
End-to-end Generative Pretraining for Multimodal Video Captioning0
Bridging Video-text Retrieval with Multiple Choice QuestionsCode1
Multi-Query Video RetrievalCode1
Watch Less and Uncover More: Could Navigation Tools Help Users Search and Explore Videos?0
Sign Language Video Retrieval with Free-Form Textual Queries0
Sound and Visual Representation Learning with Multiple Pretraining Tasks0
Everything at Once - Multi-Modal Fusion Transformer for Video RetrievalCode1
Video Joint Modelling Based on Hierarchical Transformer for Co-summarizationCode1
Cross Modal Retrieval with Querybank NormalisationCode1
Align and Prompt: Video-and-Language Pre-training with Entity PromptsCode1
Vision Transformer Based Video Hashing Retrieval for Tracing the Source of Fake VideosCode0
Self-supervised Spatiotemporal Representation Learning by Exploiting Video Continuity0
Everything at Once -- Multi-modal Fusion Transformer for Video RetrievalCode1
Prompting Visual-Language Models for Efficient Video UnderstandingCode1
Cross-modal Manifold Cutmix for Self-supervised Video Representation Learning0
Time-Equivariant Contrastive Video Representation Learning0
Show:102550
← PrevPage 6 of 10Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1OmniVectext-to-video R@1089.4Unverified
2CLIP4Cliptext-to-video R@1081.6Unverified
3OmniVec (pretrained)text-to-video R@1078.6Unverified
4HunYuan_tvr (huge)text-to-video R@162.9Unverified
5CLIP-ViPtext-to-video R@157.7Unverified
6PIDRotext-to-video R@155.9Unverified
7DMAE (ViT-B/16)text-to-video R@155.5Unverified
8HunYuan_tvrtext-to-video R@155Unverified
9MuLTItext-to-video R@154.7Unverified
10STANtext-to-video R@154.1Unverified
#ModelMetricClaimedVerifiedStatus
1Aurora (ours, r=64)text-to-video R@577.4Unverified
2InternVideo2-6Btext-to-video R@174.2Unverified
3vid-TLDR (UMT-L)text-to-video R@172.3Unverified
4VASTtext-to-video R@172Unverified
5COSAtext-to-video R@170.5Unverified
6UMT-L (ViT-L/16)text-to-video R@170.4Unverified
7GRAMtext-to-video R@167.3Unverified
8VALORtext-to-video R@161.5Unverified
9TESTA (ViT-B/16)text-to-video R@161.2Unverified
10VindLUtext-to-video R@161.2Unverified
#ModelMetricClaimedVerifiedStatus
1GRAMtext-to-video R@164Unverified
2VASTtext-to-video R@163.9Unverified
3InternVideo2-6Btext-to-video R@162.8Unverified
4VALORtext-to-video R@159.9Unverified
5UMT-L (ViT-L/16)text-to-video R@158.8Unverified
6vid-TLDR (UMT-L)text-to-video R@158.1Unverified
7COSAtext-to-video R@157.9Unverified
8InternVideo2-6Btext-to-video R@155.9Unverified
9InternVideotext-to-video R@155.2Unverified
10VLABtext-to-video R@155.1Unverified
#ModelMetricClaimedVerifiedStatus
1EMCL-Net (Ours)++ LSMDC Rohrbach et al. (2015)text-to-video R@1053.7Unverified
2InternVideo2-6Btext-to-video R@146.4Unverified
3vid-TLDR (UMT-L)text-to-video R@143.1Unverified
4UMT-L (ViT-L/16)text-to-video R@143Unverified
5HunYuan_tvr (huge)text-to-video R@140.4Unverified
6COSAtext-to-video R@139.4Unverified
7mPLUG-2text-to-video R@134.4Unverified
8VALORtext-to-video R@134.2Unverified
9InternVideotext-to-video R@134Unverified
10InternVideo2-6Btext-to-video R@133.8Unverified