SOTAVerified

Video-Text Retrieval

Video-Text retrieval requires understanding of both video and language together. Therefore it's different to video retrieval task.

Papers

Showing 110 of 111 papers

TitleStatusHype
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic AlignmentCode4
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video UnderstandingCode4
Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal UnderstandingCode3
Vision-Language Pre-training: Basics, Recent Advances, and Future TrendsCode3
One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object TrajectoryCode2
vid-TLDR: Training Free Token merging for Light-weight Video TransformerCode2
M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text RetrievalCode2
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video UnderstandingCode2
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation AlignmentCode2
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEsCode2
Show:102550
← PrevPage 1 of 12Next →

No leaderboard results yet.