| DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval | Jun 10, 2025 | Image CaptioningRetrieval | CodeCode Available | 1 |
| One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory | May 29, 2025 | Contrastive LearningText Retrieval | CodeCode Available | 2 |
| LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts | May 20, 2025 | Caption GenerationRetrieval | CodeCode Available | 1 |
| Towards Understanding Camera Motions in Any Video | Apr 21, 2025 | Question AnsweringText Retrieval | —Unverified | 0 |
| LV-MAE: Learning Long Video Representations through Masked-Embedding Autoencoders | Apr 4, 2025 | Self-Supervised LearningText Retrieval | —Unverified | 0 |
| Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval | Apr 3, 2025 | Information RetrievalRepresentation Learning | —Unverified | 0 |
| V^2Dial: Unification of Video and Visual Dialog via Multimodal Experts | Mar 3, 2025 | Contrastive LearningText Retrieval | —Unverified | 0 |
| Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding | Feb 9, 2025 | Image CaptioningImage-text Retrieval | CodeCode Available | 3 |
| Expertized Caption Auto-Enhancement for Video-Text Retrieval | Feb 5, 2025 | Caption GenerationRetrieval | CodeCode Available | 0 |
| V^2Dial: Unification of Video and Visual Dialog via Multimodal Experts | Jan 1, 2025 | Contrastive LearningText Retrieval | —Unverified | 0 |
| Rethinking Noisy Video-Text Retrieval via Relation-aware Alignment | Jan 1, 2025 | RelationRetrieval | —Unverified | 0 |
| CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval | Dec 31, 2024 | RetrievalText Retrieval | —Unverified | 0 |
| Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text Retrieval | Dec 26, 2024 | Image-text RetrievalInformation Retrieval | CodeCode Available | 0 |
| CAREL: Instruction-guided reinforcement learning with cross-modal auxiliary objectives | Nov 29, 2024 | reinforcement-learningReinforcement Learning | CodeCode Available | 0 |
| Beyond Coarse-Grained Matching in Video-Text Retrieval | Oct 16, 2024 | RetrievalText Retrieval | —Unverified | 0 |
| Text Proxy: Decomposing Retrieval from a 1-to-N Relationship into N 1-to-1 Relationships for Text-Video Retrieval | Oct 9, 2024 | RetrievalText Retrieval | CodeCode Available | 1 |
| NAVERO: Unlocking Fine-Grained Semantics for Video-Language Compositionality | Aug 18, 2024 | RetrievalText Retrieval | —Unverified | 0 |
| Learning Video Context as Interleaved Multimodal Sequences | Jul 31, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Video-Language Alignment via Spatio-Temporal Graph Transformer | Jul 16, 2024 | Contrastive LearningQuestion Answering | CodeCode Available | 1 |
| EA-VTR: Event-Aware Video-Text Retrieval | Jul 10, 2024 | Action RecognitionContrastive Learning | —Unverified | 0 |
| Multi-Scale Temporal Difference Transformer for Video-Text Retrieval | Jun 23, 2024 | RetrievalText Retrieval | —Unverified | 0 |
| Diving Deep into the Motion Representation of Video-Text Models | Jun 7, 2024 | RetrievalText Retrieval | CodeCode Available | 0 |
| HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model | Jun 1, 2024 | Action RecognitionActivity Recognition | —Unverified | 0 |
| Uncertainty-aware sign language video retrieval with probability distribution modeling | May 30, 2024 | RetrievalSign Language Retrieval | —Unverified | 0 |
| An Empirical Study of Excitation and Aggregation Design Adaptions in CLIP4Clip for Video-Text Retrieval | May 25, 2024 | RetrievalText Retrieval | —Unverified | 0 |
| RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning | May 11, 2024 | Image-text matchingRetrieval | —Unverified | 0 |
| Learning with Noisy Correspondence | Apr 13, 2024 | Cross-Modal RetrievalCross-modal retrieval with noisy correspondence | —Unverified | 0 |
| HaVTR: Improving Video-Text Retrieval Through Augmentation Using Large Foundation Models | Apr 7, 2024 | HallucinationRepresentation Learning | —Unverified | 0 |
| vid-TLDR: Training Free Token merging for Light-weight Video Transformer | Mar 20, 2024 | Action RecognitionComputational Efficiency | CodeCode Available | 2 |
| Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval | Feb 26, 2024 | RetrievalText Retrieval | —Unverified | 0 |
| Video Editing for Video Retrieval | Feb 4, 2024 | RetrievalText Retrieval | —Unverified | 0 |
| M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval | Jan 31, 2024 | RetrievalText Retrieval | CodeCode Available | 2 |
| InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks | Dec 21, 2023 | Image RetrievalImage-to-Text Retrieval | CodeCode Available | 1 |
| ProS: Prompting-to-simulate Generalized knowledge for Universal Cross-Domain Retrieval | Dec 19, 2023 | Few-Shot LearningRetrieval | CodeCode Available | 1 |
| RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos | Dec 11, 2023 | Natural Language Moment RetrievalNatural Language Queries | CodeCode Available | 1 |
| Leveraging Generative Language Models for Weakly Supervised Sentence Component Analysis in Video-Language Joint Learning | Dec 10, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding | Dec 4, 2023 | Dense CaptioningHighlight Detection | CodeCode Available | 2 |
| Harvest Video Foundation Models via Efficient Post-Pretraining | Oct 30, 2023 | Question AnsweringText Retrieval | CodeCode Available | 0 |
| TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding | Oct 29, 2023 | FormLanguage Modelling | CodeCode Available | 1 |
| Videoprompter: an ensemble of foundational models for zero-shot video understanding | Oct 23, 2023 | Action RecognitionDescriptive | —Unverified | 0 |
| Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data | Oct 8, 2023 | Action RecognitionContinual Learning | CodeCode Available | 1 |
| LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment | Oct 3, 2023 | Audio ClassificationContrastive Learning | CodeCode Available | 4 |
| Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval | Sep 29, 2023 | Cross-Modal RetrievalImage-text matching | CodeCode Available | 1 |
| Uncertainty-Aware Alignment Network for Cross-Domain Video-Text Retrieval | Sep 21, 2023 | Domain AdaptationRetrieval | —Unverified | 0 |
| Uncertainty-Aware Alignment Network for Cross-Domain Video-Text Retrieval | Sep 21, 2023 | Domain AdaptationRetrieval | —Unverified | 0 |
| Unified Coarse-to-Fine Alignment for Video-Text Retrieval | Sep 18, 2023 | RetrievalText Retrieval | CodeCode Available | 1 |
| UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory | Aug 28, 2023 | Question AnsweringRetrieval | CodeCode Available | 1 |
| Multi-event Video-Text Retrieval | Aug 22, 2023 | Language ModellingRetrieval | CodeCode Available | 1 |
| Helping Hands: An Object-Aware Ego-Centric Video Recognition Model | Aug 15, 2023 | DecoderObject | CodeCode Available | 1 |
| TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible Adapter | Jun 22, 2023 | Question AnsweringRetrieval | CodeCode Available | 0 |