SOTAVerified

Video-Text Retrieval

Video-Text retrieval requires understanding of both video and language together. Therefore it's different to video retrieval task.

Papers

Showing 125 of 111 papers

TitleStatusHype
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic AlignmentCode4
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video UnderstandingCode4
Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal UnderstandingCode3
Vision-Language Pre-training: Basics, Recent Advances, and Future TrendsCode3
One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object TrajectoryCode2
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation AlignmentCode2
Egocentric Video-Language PretrainingCode2
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video UnderstandingCode2
M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text RetrievalCode2
vid-TLDR: Training Free Token merging for Light-weight Video TransformerCode2
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEsCode2
COOT: Cooperative Hierarchical Transformer for Video-Text Representation LearningCode1
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text RetrievalCode1
Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and DataCode1
Bridging Video-text Retrieval with Multiple Choice QuestionsCode1
Fine-grained Video-Text Retrieval with Hierarchical Graph ReasoningCode1
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connectionsCode1
Learning the Best Pooling Strategy for Visual Semantic EmbeddingCode1
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax LossCode1
DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text RetrievalCode1
CLIP2Video: Mastering Video-Text Retrieval via Image CLIPCode1
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic TasksCode1
LoVR: A Benchmark for Long Video Retrieval in Multimodal ContextsCode1
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip RetrievalCode1
Global and Local Semantic Completion Learning for Vision-Language Pre-trainingCode1
Show:102550
← PrevPage 1 of 5Next →

No leaderboard results yet.