SOTAVerified

Text to Video Retrieval

She's gone I can't find her anywhere I'm looking everywhere for her Everywhere is dark

Papers

Showing 150 of 75 papers

TitleStatusHype
X^2-VLM: All-In-One Pre-trained Model For Vision-Language TasksCode2
Revealing Single Frame Bias for Video-and-Language LearningCode2
X-Pool: Cross-Modal Language-Video Attention for Text-Video RetrievalCode1
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual ModelingCode1
Bridging Video-text Retrieval with Multiple Choice QuestionsCode1
Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and DataCode1
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip RetrievalCode1
Clover: Towards A Unified Video-Language Alignment and Fusion ModelCode1
Condensed Movies: Story Based Retrieval with Contextual EmbeddingsCode1
DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy MinimizationCode1
Dual Learning with Dynamic Knowledge Distillation for Partially Relevant Video RetrievalCode1
ECLIPSE: Efficient Long-range Video Retrieval using Sight and SoundCode1
End-to-End Learning of Visual Representations from Uncurated Instructional VideosCode1
Frozen in Time: A Joint Video and Image Encoder for End-to-End RetrievalCode1
GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and RetrievalCode1
Holistic Features are almost Sufficient for Text-to-Video RetrievalCode1
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video ClipsCode1
LAVENDER: Unifying Video-Language Understanding as Masked Language ModelingCode1
Less is More: ClipBERT for Video-and-Language Learning via Sparse SamplingCode1
Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video RetrievalCode1
MDMMT: Multidomain Multimodal Transformer for Video RetrievalCode1
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation ModelsCode1
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text RetrievalCode1
Multimodal Clustering Networks for Self-supervised Learning from Unlabeled VideosCode1
Partially Relevant Video RetrievalCode1
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal RetrievalCode1
Reading-strategy Inspired Visual Representation Learning for Text-to-Video RetrievalCode1
Revisiting the "Video" in Video-Language UnderstandingCode1
Revitalize Region Feature for Democratizing Video-Language Pre-training of RetrievalCode1
StableFusion: Continual Video Retrieval via Frame AdaptationCode1
The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020)Code1
Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation LearningCode1
Unified Coarse-to-Fine Alignment for Video-Text RetrievalCode1
VALUE: A Multi-Task Benchmark for Video-and-Language Understanding EvaluationCode1
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and TextCode1
VideoCon: Robust Video-Language Alignment via Contrast CaptionsCode1
VindLU: A Recipe for Effective Video-and-Language PretrainingCode1
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token ModelingCode1
Advancing High-Resolution Video-Language Representation with Large-Scale Video TranscriptionsCode1
Towards Efficient Partially Relevant Video Retrieval with Active Moment DiscoveringCode0
RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video RetrievalCode0
Robustness Analysis of Video-Language Models Against Visual and Language PerturbationsCode0
Learning to Retrieve Videos by Asking QuestionsCode0
Noise Estimation Using Density Estimation for Self-Supervised Multimodal LearningCode0
ContextIQ: A Multimodal Expert-Based Video Retrieval System for Contextual AdvertisingCode0
MSVD-Indonesian: A Benchmark for Multimodal Video-Text Tasks in IndonesianCode0
FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding TasksCode0
Semantic Role Aware Correlation Transformer for Text to Video RetrievalCode0
TC-MGC: Text-Conditioned Multi-Grained Contrastive Learning for Text-Video RetrievalCode0
Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video RetrievalCode0
Show:102550
← PrevPage 1 of 2Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1FROZEN-revisedmAP23.39Unverified
2FROZEN-revised (two-stream)text-to-video R@112.8Unverified
#ModelMetricClaimedVerifiedStatus
1CLIP4Cliptext-to-video R@144.5Unverified
#ModelMetricClaimedVerifiedStatus
1X-CLIP (Cross-Lingual)R@132.3Unverified