SOTAVerified

Video-Text Retrieval

Video-Text retrieval requires understanding of both video and language together. Therefore it's different to video retrieval task.

Papers

Showing 2650 of 111 papers

TitleStatusHype
DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text RetrievalCode1
MV-Adapter: Multimodal Video Transfer Learning for Video Text RetrievalCode1
Frozen in Time: A Joint Video and Image Encoder for End-to-End RetrievalCode1
Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and DataCode1
Global and Local Semantic Completion Learning for Vision-Language Pre-trainingCode1
Polysemous Visual-Semantic Embedding for Cross-Modal RetrievalCode1
ProS: Prompting-to-simulate Generalized knowledge for Universal Cross-Domain RetrievalCode1
Helping Hands: An Object-Aware Ego-Centric Video Recognition ModelCode1
Text Proxy: Decomposing Retrieval from a 1-to-N Relationship into N 1-to-1 Relationships for Text-Video RetrievalCode1
Multi-event Video-Text RetrievalCode1
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connectionsCode1
Cross-Modal Retrieval with Partially Mismatched PairsCode1
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax LossCode1
HANet: Hierarchical Alignment Networks for Video-Text RetrievalCode1
Learning Video Context as Interleaved Multimodal SequencesCode1
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge TransferringCode1
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal RetrievalCode1
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text RetrievalCode1
RGNet: A Unified Clip Retrieval and Grounding Network for Long VideosCode1
Learning the Best Pooling Strategy for Visual Semantic EmbeddingCode1
Seeing What You Miss: Vision-Language Pre-training with Semantic Completion LearningCode1
VTC: Improving Video-Text Retrieval with User CommentsCode1
LoVR: A Benchmark for Long Video Retrieval in Multimodal ContextsCode1
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip RetrievalCode1
X-Pool: Cross-Modal Language-Video Attention for Text-Video RetrievalCode1
Show:102550
← PrevPage 2 of 5Next →

No leaderboard results yet.