| Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding | Jun 5, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 4 | 5 |
| LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment | Oct 3, 2023 | Audio ClassificationContrastive Learning | CodeCode Available | 4 | 5 |
| Vision-Language Pre-training: Basics, Recent Advances, and Future Trends | Oct 17, 2022 | Few-Shot LearningImage Captioning | CodeCode Available | 3 | 5 |
| Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding | Feb 9, 2025 | Image CaptioningImage-text Retrieval | CodeCode Available | 3 | 5 |
| Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs | Jun 9, 2022 | Image CaptioningImage Classification | CodeCode Available | 2 | 5 |
| vid-TLDR: Training Free Token merging for Light-weight Video Transformer | Mar 20, 2024 | Action RecognitionComputational Efficiency | CodeCode Available | 2 | 5 |
| TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding | Dec 4, 2023 | Dense CaptioningHighlight Detection | CodeCode Available | 2 | 5 |
| Egocentric Video-Language Pretraining | Jun 3, 2022 | Action RecognitionContrastive Learning | CodeCode Available | 2 | 5 |
| M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval | Jan 31, 2024 | RetrievalText Retrieval | CodeCode Available | 2 | 5 |
| CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment | Sep 14, 2022 | RetrievalText Retrieval | CodeCode Available | 2 | 5 |
| One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory | May 29, 2025 | Contrastive LearningText Retrieval | CodeCode Available | 2 | 5 |
| COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning | Nov 1, 2020 | Cross-Modal RetrievalRepresentation Learning | CodeCode Available | 1 | 5 |
| UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling | Feb 13, 2023 | Image-text RetrievalRetrieval | CodeCode Available | 1 | 5 |
| Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data | Oct 8, 2023 | Action RecognitionContinual Learning | CodeCode Available | 1 | 5 |
| Unified Coarse-to-Fine Alignment for Video-Text Retrieval | Sep 18, 2023 | RetrievalText Retrieval | CodeCode Available | 1 | 5 |
| Test of Time: Instilling Video-Language Models with a Sense of Time | Jan 5, 2023 | Video-Text RetrievalVideo Understanding | CodeCode Available | 1 | 5 |
| VTC: Improving Video-Text Retrieval with User Comments | Oct 19, 2022 | Representation LearningRetrieval | CodeCode Available | 1 | 5 |
| X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval | Jul 15, 2022 | Contrastive LearningRetrieval | CodeCode Available | 1 | 5 |
| CLIP2Video: Mastering Video-Text Retrieval via Image CLIP | Jun 21, 2021 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Video-Text Pre-training with Learned Regions | Dec 2, 2021 | Representation LearningRetrieval | CodeCode Available | 1 | 5 |
| Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval | Sep 29, 2023 | Cross-Modal RetrievalImage-text matching | CodeCode Available | 1 | 5 |
| RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos | Dec 11, 2023 | Natural Language Moment RetrievalNatural Language Queries | CodeCode Available | 1 | 5 |
| Bridging Video-text Retrieval with Multiple Choice Questions | Jan 13, 2022 | Action RecognitionLinear evaluation | CodeCode Available | 1 | 5 |
| CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval | Apr 18, 2021 | RetrievalText Retrieval | CodeCode Available | 1 | 5 |
| InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks | Dec 21, 2023 | Image RetrievalImage-to-Text Retrieval | CodeCode Available | 1 | 5 |
| TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding | Oct 29, 2023 | FormLanguage Modelling | CodeCode Available | 1 | 5 |
| Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning | Mar 1, 2020 | Cross-Modal RetrievalRetrieval | CodeCode Available | 1 | 5 |
| Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval | Apr 1, 2021 | RetrievalText Retrieval | CodeCode Available | 1 | 5 |
| DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval | Jun 10, 2025 | Image CaptioningRetrieval | CodeCode Available | 1 | 5 |
| Global and Local Semantic Completion Learning for Vision-Language Pre-training | Jun 12, 2023 | cross-modal alignmentImage-text Retrieval | CodeCode Available | 1 | 5 |
| Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning | Nov 24, 2022 | cross-modal alignmentImage-text Retrieval | CodeCode Available | 1 | 5 |
| SViTT: Temporal Learning of Sparse Video-Text Transformers | Apr 18, 2023 | Question AnsweringRetrieval | CodeCode Available | 1 | 5 |
| MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval | Apr 26, 2022 | Action RecognitionRetrieval | CodeCode Available | 1 | 5 |
| mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections | May 24, 2022 | Computational Efficiencycross-modal alignment | CodeCode Available | 1 | 5 |
| MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval | Jan 19, 2023 | RetrievalText Retrieval | CodeCode Available | 1 | 5 |
| Helping Hands: An Object-Aware Ego-Centric Video Recognition Model | Aug 15, 2023 | DecoderObject | CodeCode Available | 1 | 5 |
| Multi-event Video-Text Retrieval | Aug 22, 2023 | Language ModellingRetrieval | CodeCode Available | 1 | 5 |
| Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss | Sep 9, 2021 | Mixture-of-ExpertsRetrieval | CodeCode Available | 1 | 5 |
| Text Proxy: Decomposing Retrieval from a 1-to-N Relationship into N 1-to-1 Relationships for Text-Video Retrieval | Oct 9, 2024 | RetrievalText Retrieval | CodeCode Available | 1 | 5 |
| Cross-Modal Retrieval with Partially Mismatched Pairs | Feb 22, 2023 | Contrastive LearningCross-Modal Retrieval | CodeCode Available | 1 | 5 |
| ProS: Prompting-to-simulate Generalized knowledge for Universal Cross-Domain Retrieval | Dec 19, 2023 | Few-Shot LearningRetrieval | CodeCode Available | 1 | 5 |
| Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval | Jun 11, 2019 | Cross-Modal RetrievalMultiple Instance Learning | CodeCode Available | 1 | 5 |
| HANet: Hierarchical Alignment Networks for Video-Text Retrieval | Jul 26, 2021 | RetrievalText Matching | CodeCode Available | 1 | 5 |
| Video-Language Alignment via Spatio-Temporal Graph Transformer | Jul 16, 2024 | Contrastive LearningQuestion Answering | CodeCode Available | 1 | 5 |
| Learning the Best Pooling Strategy for Visual Semantic Embedding | Nov 9, 2020 | Cross-Modal Information RetrievalImage-text Retrieval | CodeCode Available | 1 | 5 |
| Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring | Jan 26, 2023 | Representation LearningRetrieval | CodeCode Available | 1 | 5 |
| Learning Video Context as Interleaved Multimodal Sequences | Jul 31, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts | May 20, 2025 | Caption GenerationRetrieval | CodeCode Available | 1 | 5 |
| UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory | Aug 28, 2023 | Question AnsweringRetrieval | CodeCode Available | 1 | 5 |
| X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval | Mar 28, 2022 | RetrievalText to Video Retrieval | CodeCode Available | 1 | 5 |