SOTAVerified

Video-Text Retrieval

Video-Text retrieval requires understanding of both video and language together. Therefore it's different to video retrieval task.

Papers

Showing 51100 of 111 papers

TitleStatusHype
Towards Understanding Camera Motions in Any Video0
LV-MAE: Learning Long Video Representations through Masked-Embedding Autoencoders0
Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval0
V^2Dial: Unification of Video and Visual Dialog via Multimodal Experts0
Expertized Caption Auto-Enhancement for Video-Text RetrievalCode0
Rethinking Noisy Video-Text Retrieval via Relation-aware Alignment0
V^2Dial: Unification of Video and Visual Dialog via Multimodal Experts0
CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval0
Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text RetrievalCode0
CAREL: Instruction-guided reinforcement learning with cross-modal auxiliary objectivesCode0
Beyond Coarse-Grained Matching in Video-Text Retrieval0
NAVERO: Unlocking Fine-Grained Semantics for Video-Language Compositionality0
EA-VTR: Event-Aware Video-Text Retrieval0
Multi-Scale Temporal Difference Transformer for Video-Text Retrieval0
Diving Deep into the Motion Representation of Video-Text ModelsCode0
HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model0
Uncertainty-aware sign language video retrieval with probability distribution modeling0
An Empirical Study of Excitation and Aggregation Design Adaptions in CLIP4Clip for Video-Text Retrieval0
RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning0
Learning with Noisy Correspondence0
HaVTR: Improving Video-Text Retrieval Through Augmentation Using Large Foundation Models0
Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval0
Video Editing for Video Retrieval0
Leveraging Generative Language Models for Weakly Supervised Sentence Component Analysis in Video-Language Joint Learning0
Harvest Video Foundation Models via Efficient Post-PretrainingCode0
Videoprompter: an ensemble of foundational models for zero-shot video understanding0
Uncertainty-Aware Alignment Network for Cross-Domain Video-Text Retrieval0
Uncertainty-Aware Alignment Network for Cross-Domain Video-Text Retrieval0
TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible AdapterCode0
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending0
Mask to reconstruct: Cooperative Semantics Completion for Video-text Retrieval0
Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception0
CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive LearningCode0
Deep Learning for Video-Text Retrieval: a Review0
Video-Text Retrieval by Supervised Sparse Multi-Grained LearningCode0
Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval0
Dual Alignment Unsupervised Domain Adaptation for Video-Text Retrieval0
HiVLP: Hierarchical Interactive Video-Language Pre-Training0
ViLEM: Visual-Language Error Modeling for Image-Text Retrieval0
Masked Contrastive Pre-Training for Efficient Video-Text Retrieval0
TokenFlow: Rethinking Fine-grained Cross-modal Alignment in Vision-Language Retrieval0
Unified Loss of Pair Similarity Optimization for Vision-Language Retrieval0
Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval0
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks0
Boosting Video-Text Retrieval with Explicit High-Level Semantics0
LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval0
Generalizing Multimodal Pre-training into Multilingual via Language Acquisition0
Synopses of Movie Narratives: a Video-Language Dataset for Story Understanding0
Synopses of Movie Narratives: a Video-Language Dataset for Story Understanding0
CLIP2TV: Align, Match and Distill for Video-Text Retrieval0
Show:102550
← PrevPage 2 of 3Next →

No leaderboard results yet.