| Towards Understanding Camera Motions in Any Video | Apr 21, 2025 | Question AnsweringText Retrieval | —Unverified | 0 |
| LV-MAE: Learning Long Video Representations through Masked-Embedding Autoencoders | Apr 4, 2025 | Self-Supervised LearningText Retrieval | —Unverified | 0 |
| Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval | Apr 3, 2025 | Information RetrievalRepresentation Learning | —Unverified | 0 |
| V^2Dial: Unification of Video and Visual Dialog via Multimodal Experts | Mar 3, 2025 | Contrastive LearningText Retrieval | —Unverified | 0 |
| Expertized Caption Auto-Enhancement for Video-Text Retrieval | Feb 5, 2025 | Caption GenerationRetrieval | CodeCode Available | 0 |
| Rethinking Noisy Video-Text Retrieval via Relation-aware Alignment | Jan 1, 2025 | RelationRetrieval | —Unverified | 0 |
| V^2Dial: Unification of Video and Visual Dialog via Multimodal Experts | Jan 1, 2025 | Contrastive LearningText Retrieval | —Unverified | 0 |
| CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval | Dec 31, 2024 | RetrievalText Retrieval | —Unverified | 0 |
| Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text Retrieval | Dec 26, 2024 | Image-text RetrievalInformation Retrieval | CodeCode Available | 0 |
| CAREL: Instruction-guided reinforcement learning with cross-modal auxiliary objectives | Nov 29, 2024 | reinforcement-learningReinforcement Learning | CodeCode Available | 0 |
| Beyond Coarse-Grained Matching in Video-Text Retrieval | Oct 16, 2024 | RetrievalText Retrieval | —Unverified | 0 |
| NAVERO: Unlocking Fine-Grained Semantics for Video-Language Compositionality | Aug 18, 2024 | RetrievalText Retrieval | —Unverified | 0 |
| EA-VTR: Event-Aware Video-Text Retrieval | Jul 10, 2024 | Action RecognitionContrastive Learning | —Unverified | 0 |
| Multi-Scale Temporal Difference Transformer for Video-Text Retrieval | Jun 23, 2024 | RetrievalText Retrieval | —Unverified | 0 |
| Diving Deep into the Motion Representation of Video-Text Models | Jun 7, 2024 | RetrievalText Retrieval | CodeCode Available | 0 |
| HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model | Jun 1, 2024 | Action RecognitionActivity Recognition | —Unverified | 0 |
| Uncertainty-aware sign language video retrieval with probability distribution modeling | May 30, 2024 | RetrievalSign Language Retrieval | —Unverified | 0 |
| An Empirical Study of Excitation and Aggregation Design Adaptions in CLIP4Clip for Video-Text Retrieval | May 25, 2024 | RetrievalText Retrieval | —Unverified | 0 |
| RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning | May 11, 2024 | Image-text matchingRetrieval | —Unverified | 0 |
| Learning with Noisy Correspondence | Apr 13, 2024 | Cross-Modal RetrievalCross-modal retrieval with noisy correspondence | —Unverified | 0 |
| HaVTR: Improving Video-Text Retrieval Through Augmentation Using Large Foundation Models | Apr 7, 2024 | HallucinationRepresentation Learning | —Unverified | 0 |
| Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval | Feb 26, 2024 | RetrievalText Retrieval | —Unverified | 0 |
| Video Editing for Video Retrieval | Feb 4, 2024 | RetrievalText Retrieval | —Unverified | 0 |
| Leveraging Generative Language Models for Weakly Supervised Sentence Component Analysis in Video-Language Joint Learning | Dec 10, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Harvest Video Foundation Models via Efficient Post-Pretraining | Oct 30, 2023 | Question AnsweringText Retrieval | CodeCode Available | 0 |
| Videoprompter: an ensemble of foundational models for zero-shot video understanding | Oct 23, 2023 | Action RecognitionDescriptive | —Unverified | 0 |
| Uncertainty-Aware Alignment Network for Cross-Domain Video-Text Retrieval | Sep 21, 2023 | Domain AdaptationRetrieval | —Unverified | 0 |
| Uncertainty-Aware Alignment Network for Cross-Domain Video-Text Retrieval | Sep 21, 2023 | Domain AdaptationRetrieval | —Unverified | 0 |
| TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible Adapter | Jun 22, 2023 | Question AnsweringRetrieval | CodeCode Available | 0 |
| VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending | May 22, 2023 | Question AnsweringRetrieval | —Unverified | 0 |
| Mask to reconstruct: Cooperative Semantics Completion for Video-text Retrieval | May 13, 2023 | RetrievalText Retrieval | —Unverified | 0 |
| Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception | May 10, 2023 | Classificationimage-classification | —Unverified | 0 |
| CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning | Mar 22, 2023 | Contrastive LearningRetrieval | CodeCode Available | 0 |
| Deep Learning for Video-Text Retrieval: a Review | Feb 24, 2023 | Deep LearningRetrieval | —Unverified | 0 |
| Video-Text Retrieval by Supervised Sparse Multi-Grained Learning | Feb 19, 2023 | Representation LearningRetrieval | CodeCode Available | 0 |
| Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval | Jan 30, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Dual Alignment Unsupervised Domain Adaptation for Video-Text Retrieval | Jan 1, 2023 | Domain AdaptationRetrieval | —Unverified | 0 |
| HiVLP: Hierarchical Interactive Video-Language Pre-Training | Jan 1, 2023 | RetrievalSelf-Supervised Learning | —Unverified | 0 |
| ViLEM: Visual-Language Error Modeling for Image-Text Retrieval | Jan 1, 2023 | Contrastive LearningImage-text Retrieval | —Unverified | 0 |
| Masked Contrastive Pre-Training for Efficient Video-Text Retrieval | Dec 2, 2022 | Image-text RetrievalRetrieval | —Unverified | 0 |
| TokenFlow: Rethinking Fine-grained Cross-modal Alignment in Vision-Language Retrieval | Sep 28, 2022 | cross-modal alignmentRetrieval | —Unverified | 0 |
| Unified Loss of Pair Similarity Optimization for Vision-Language Retrieval | Sep 28, 2022 | Contrastive LearningRetrieval | —Unverified | 0 |
| Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval | Sep 27, 2022 | Cross-Modal RetrievalRetrieval | —Unverified | 0 |
| OmniVL:One Foundation Model for Image-Language and Video-Language Tasks | Sep 15, 2022 | Action ClassificationAction Recognition | —Unverified | 0 |
| Boosting Video-Text Retrieval with Explicit High-Level Semantics | Aug 8, 2022 | RetrievalText Retrieval | —Unverified | 0 |
| LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval | Jul 11, 2022 | Representation LearningRetrieval | —Unverified | 0 |
| Generalizing Multimodal Pre-training into Multilingual via Language Acquisition | May 29, 2022 | Language AcquisitionRetrieval | —Unverified | 0 |
| Synopses of Movie Narratives: a Video-Language Dataset for Story Understanding | Mar 11, 2022 | RetrievalText Retrieval | —Unverified | 0 |
| Synopses of Movie Narratives: a Video-Language Dataset for Story Understanding | Jan 16, 2022 | RetrievalText Retrieval | —Unverified | 0 |
| CLIP2TV: Align, Match and Distill for Video-Text Retrieval | Nov 10, 2021 | Representation LearningRetrieval | —Unverified | 0 |