SOTAVerified

Text to Video Retrieval

She's gone I can't find her anywhere I'm looking everywhere for her Everywhere is dark

Papers

Showing 5175 of 75 papers

TitleStatusHype
Distilling Vision-Language Models on Millions of Videos0
Leveraging Generative Language Models for Weakly Supervised Sentence Component Analysis in Video-Language Joint Learning0
E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer0
An Empirical Study of Frame Selection for Text-to-Video Retrieval0
TeachCLIP: Multi-Grained Teaching for Efficient Text-to-Video Retrieval0
Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment0
MSVD-Indonesian: A Benchmark for Multimodal Video-Text Tasks in IndonesianCode0
Efficient End-to-End Video Question Answering with Pyramidal Multimodal TransformerCode0
Temporal Perceiving Video-Language Pre-training0
Learning Trajectory-Word Alignments for Video-Language Tasks0
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners0
Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video RetrievalCode0
SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training0
Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval Benchmarks0
Robustness Analysis of Video-Language Models Against Visual and Language PerturbationsCode0
Semantic Role Aware Correlation Transformer for Text to Video RetrievalCode0
RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video RetrievalCode0
Learning to Retrieve Videos by Asking QuestionsCode0
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval0
FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding TasksCode0
MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization0
CUPID: Adaptive Curation of Pre-training Data for Video-and-Language Representation Learning0
Support-set bottlenecks for video-text representation learning0
Retrieving and Highlighting Action with Spatiotemporal Reference0
Noise Estimation Using Density Estimation for Self-Supervised Multimodal LearningCode0
Show:102550
← PrevPage 2 of 2Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1FROZEN-revisedmAP23.39Unverified
2FROZEN-revised (two-stream)text-to-video R@112.8Unverified
#ModelMetricClaimedVerifiedStatus
1CLIP4Cliptext-to-video R@144.5Unverified
#ModelMetricClaimedVerifiedStatus
1X-CLIP (Cross-Lingual)R@132.3Unverified