SOTAVerified

Video-Text Retrieval

Video-Text retrieval requires understanding of both video and language together. Therefore it's different to video retrieval task.

Papers

Showing 150 of 111 papers

TitleStatusHype
DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text RetrievalCode1
One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object TrajectoryCode2
LoVR: A Benchmark for Long Video Retrieval in Multimodal ContextsCode1
Towards Understanding Camera Motions in Any Video0
LV-MAE: Learning Long Video Representations through Masked-Embedding Autoencoders0
Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval0
V^2Dial: Unification of Video and Visual Dialog via Multimodal Experts0
Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal UnderstandingCode3
Expertized Caption Auto-Enhancement for Video-Text RetrievalCode0
V^2Dial: Unification of Video and Visual Dialog via Multimodal Experts0
Rethinking Noisy Video-Text Retrieval via Relation-aware Alignment0
CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval0
Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text RetrievalCode0
CAREL: Instruction-guided reinforcement learning with cross-modal auxiliary objectivesCode0
Beyond Coarse-Grained Matching in Video-Text Retrieval0
Text Proxy: Decomposing Retrieval from a 1-to-N Relationship into N 1-to-1 Relationships for Text-Video RetrievalCode1
NAVERO: Unlocking Fine-Grained Semantics for Video-Language Compositionality0
Learning Video Context as Interleaved Multimodal SequencesCode1
Video-Language Alignment via Spatio-Temporal Graph TransformerCode1
EA-VTR: Event-Aware Video-Text Retrieval0
Multi-Scale Temporal Difference Transformer for Video-Text Retrieval0
Diving Deep into the Motion Representation of Video-Text ModelsCode0
HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model0
Uncertainty-aware sign language video retrieval with probability distribution modeling0
An Empirical Study of Excitation and Aggregation Design Adaptions in CLIP4Clip for Video-Text Retrieval0
RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning0
Learning with Noisy Correspondence0
HaVTR: Improving Video-Text Retrieval Through Augmentation Using Large Foundation Models0
vid-TLDR: Training Free Token merging for Light-weight Video TransformerCode2
Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval0
Video Editing for Video Retrieval0
M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text RetrievalCode2
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic TasksCode1
ProS: Prompting-to-simulate Generalized knowledge for Universal Cross-Domain RetrievalCode1
RGNet: A Unified Clip Retrieval and Grounding Network for Long VideosCode1
Leveraging Generative Language Models for Weakly Supervised Sentence Component Analysis in Video-Language Joint Learning0
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video UnderstandingCode2
Harvest Video Foundation Models via Efficient Post-PretrainingCode0
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language UnderstandingCode1
Videoprompter: an ensemble of foundational models for zero-shot video understanding0
Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and DataCode1
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic AlignmentCode4
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal RetrievalCode1
Uncertainty-Aware Alignment Network for Cross-Domain Video-Text Retrieval0
Uncertainty-Aware Alignment Network for Cross-Domain Video-Text Retrieval0
Unified Coarse-to-Fine Alignment for Video-Text RetrievalCode1
UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and MemoryCode1
Multi-event Video-Text RetrievalCode1
Helping Hands: An Object-Aware Ego-Centric Video Recognition ModelCode1
TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible AdapterCode0
Show:102550
← PrevPage 1 of 3Next →

No leaderboard results yet.