SOTAVerified

Video-Text Retrieval

Video-Text retrieval requires understanding of both video and language together. Therefore it's different to video retrieval task.

Papers

Showing 51100 of 111 papers

TitleStatusHype
Global and Local Semantic Completion Learning for Vision-Language Pre-trainingCode1
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video UnderstandingCode4
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending0
Mask to reconstruct: Cooperative Semantics Completion for Video-text Retrieval0
Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception0
SViTT: Temporal Learning of Sparse Video-Text TransformersCode1
CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive LearningCode0
Deep Learning for Video-Text Retrieval: a Review0
Cross-Modal Retrieval with Partially Mismatched PairsCode1
Video-Text Retrieval by Supervised Sparse Multi-Grained LearningCode0
UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal ModelingCode1
Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval0
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge TransferringCode1
MV-Adapter: Multimodal Video Transfer Learning for Video Text RetrievalCode1
Test of Time: Instilling Video-Language Models with a Sense of TimeCode1
HiVLP: Hierarchical Interactive Video-Language Pre-Training0
Dual Alignment Unsupervised Domain Adaptation for Video-Text Retrieval0
ViLEM: Visual-Language Error Modeling for Image-Text Retrieval0
Masked Contrastive Pre-Training for Efficient Video-Text Retrieval0
Seeing What You Miss: Vision-Language Pre-training with Semantic Completion LearningCode1
VTC: Improving Video-Text Retrieval with User CommentsCode1
Vision-Language Pre-training: Basics, Recent Advances, and Future TrendsCode3
TokenFlow: Rethinking Fine-grained Cross-modal Alignment in Vision-Language Retrieval0
Unified Loss of Pair Similarity Optimization for Vision-Language Retrieval0
Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval0
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks0
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation AlignmentCode2
Boosting Video-Text Retrieval with Explicit High-Level Semantics0
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text RetrievalCode1
LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval0
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEsCode2
Egocentric Video-Language PretrainingCode2
Generalizing Multimodal Pre-training into Multilingual via Language Acquisition0
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connectionsCode1
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text RetrievalCode1
X-Pool: Cross-Modal Language-Video Attention for Text-Video RetrievalCode1
Synopses of Movie Narratives: a Video-Language Dataset for Story Understanding0
Synopses of Movie Narratives: a Video-Language Dataset for Story Understanding0
Bridging Video-text Retrieval with Multiple Choice QuestionsCode1
Video-Text Pre-training with Learned RegionsCode1
CLIP2TV: Align, Match and Distill for Video-Text Retrieval0
ViSeRet: A simple yet effective approach to moment retrieval via fine-grained video segmentation0
CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations0
Learning Context-Adapted Video-Text Retrieval by Attending to User Comments0
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax LossCode1
HANet: Hierarchical Alignment Networks for Video-Text RetrievalCode1
CLIP2Video: Mastering Video-Text Retrieval via Image CLIPCode1
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip RetrievalCode1
Frozen in Time: A Joint Video and Image Encoder for End-to-End RetrievalCode1
Memory Enhanced Embedding Learning for Cross-Modal Video-Text Retrieval0
Show:102550
← PrevPage 2 of 3Next →

No leaderboard results yet.