| Global and Local Semantic Completion Learning for Vision-Language Pre-training | Jun 12, 2023 | cross-modal alignmentImage-text Retrieval | CodeCode Available | 1 |
| Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding | Jun 5, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 4 |
| VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending | May 22, 2023 | Question AnsweringRetrieval | —Unverified | 0 |
| Mask to reconstruct: Cooperative Semantics Completion for Video-text Retrieval | May 13, 2023 | RetrievalText Retrieval | —Unverified | 0 |
| Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception | May 10, 2023 | Classificationimage-classification | —Unverified | 0 |
| SViTT: Temporal Learning of Sparse Video-Text Transformers | Apr 18, 2023 | Question AnsweringRetrieval | CodeCode Available | 1 |
| CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning | Mar 22, 2023 | Contrastive LearningRetrieval | CodeCode Available | 0 |
| Deep Learning for Video-Text Retrieval: a Review | Feb 24, 2023 | Deep LearningRetrieval | —Unverified | 0 |
| Cross-Modal Retrieval with Partially Mismatched Pairs | Feb 22, 2023 | Contrastive LearningCross-Modal Retrieval | CodeCode Available | 1 |
| Video-Text Retrieval by Supervised Sparse Multi-Grained Learning | Feb 19, 2023 | Representation LearningRetrieval | CodeCode Available | 0 |
| UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling | Feb 13, 2023 | Image-text RetrievalRetrieval | CodeCode Available | 1 |
| Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval | Jan 30, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring | Jan 26, 2023 | Representation LearningRetrieval | CodeCode Available | 1 |
| MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval | Jan 19, 2023 | RetrievalText Retrieval | CodeCode Available | 1 |
| Test of Time: Instilling Video-Language Models with a Sense of Time | Jan 5, 2023 | Video-Text RetrievalVideo Understanding | CodeCode Available | 1 |
| HiVLP: Hierarchical Interactive Video-Language Pre-Training | Jan 1, 2023 | RetrievalSelf-Supervised Learning | —Unverified | 0 |
| Dual Alignment Unsupervised Domain Adaptation for Video-Text Retrieval | Jan 1, 2023 | Domain AdaptationRetrieval | —Unverified | 0 |
| ViLEM: Visual-Language Error Modeling for Image-Text Retrieval | Jan 1, 2023 | Contrastive LearningImage-text Retrieval | —Unverified | 0 |
| Masked Contrastive Pre-Training for Efficient Video-Text Retrieval | Dec 2, 2022 | Image-text RetrievalRetrieval | —Unverified | 0 |
| Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning | Nov 24, 2022 | cross-modal alignmentImage-text Retrieval | CodeCode Available | 1 |
| VTC: Improving Video-Text Retrieval with User Comments | Oct 19, 2022 | Representation LearningRetrieval | CodeCode Available | 1 |
| Vision-Language Pre-training: Basics, Recent Advances, and Future Trends | Oct 17, 2022 | Few-Shot LearningImage Captioning | CodeCode Available | 3 |
| TokenFlow: Rethinking Fine-grained Cross-modal Alignment in Vision-Language Retrieval | Sep 28, 2022 | cross-modal alignmentRetrieval | —Unverified | 0 |
| Unified Loss of Pair Similarity Optimization for Vision-Language Retrieval | Sep 28, 2022 | Contrastive LearningRetrieval | —Unverified | 0 |
| Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval | Sep 27, 2022 | Cross-Modal RetrievalRetrieval | —Unverified | 0 |
| OmniVL:One Foundation Model for Image-Language and Video-Language Tasks | Sep 15, 2022 | Action ClassificationAction Recognition | —Unverified | 0 |
| CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment | Sep 14, 2022 | RetrievalText Retrieval | CodeCode Available | 2 |
| Boosting Video-Text Retrieval with Explicit High-Level Semantics | Aug 8, 2022 | RetrievalText Retrieval | —Unverified | 0 |
| X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval | Jul 15, 2022 | Contrastive LearningRetrieval | CodeCode Available | 1 |
| LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval | Jul 11, 2022 | Representation LearningRetrieval | —Unverified | 0 |
| Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs | Jun 9, 2022 | Image CaptioningImage Classification | CodeCode Available | 2 |
| Egocentric Video-Language Pretraining | Jun 3, 2022 | Action RecognitionContrastive Learning | CodeCode Available | 2 |
| Generalizing Multimodal Pre-training into Multilingual via Language Acquisition | May 29, 2022 | Language AcquisitionRetrieval | —Unverified | 0 |
| mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections | May 24, 2022 | Computational Efficiencycross-modal alignment | CodeCode Available | 1 |
| MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval | Apr 26, 2022 | Action RecognitionRetrieval | CodeCode Available | 1 |
| X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval | Mar 28, 2022 | RetrievalText to Video Retrieval | CodeCode Available | 1 |
| Synopses of Movie Narratives: a Video-Language Dataset for Story Understanding | Mar 11, 2022 | RetrievalText Retrieval | —Unverified | 0 |
| Synopses of Movie Narratives: a Video-Language Dataset for Story Understanding | Jan 16, 2022 | RetrievalText Retrieval | —Unverified | 0 |
| Bridging Video-text Retrieval with Multiple Choice Questions | Jan 13, 2022 | Action RecognitionLinear evaluation | CodeCode Available | 1 |
| Video-Text Pre-training with Learned Regions | Dec 2, 2021 | Representation LearningRetrieval | CodeCode Available | 1 |
| CLIP2TV: Align, Match and Distill for Video-Text Retrieval | Nov 10, 2021 | Representation LearningRetrieval | —Unverified | 0 |
| ViSeRet: A simple yet effective approach to moment retrieval via fine-grained video segmentation | Oct 11, 2021 | Moment RetrievalRetrieval | —Unverified | 0 |
| CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations | Sep 30, 2021 | Contrastive LearningRetrieval | —Unverified | 0 |
| Learning Context-Adapted Video-Text Retrieval by Attending to User Comments | Sep 29, 2021 | RetrievalText Retrieval | —Unverified | 0 |
| Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss | Sep 9, 2021 | Mixture-of-ExpertsRetrieval | CodeCode Available | 1 |
| HANet: Hierarchical Alignment Networks for Video-Text Retrieval | Jul 26, 2021 | RetrievalText Matching | CodeCode Available | 1 |
| CLIP2Video: Mastering Video-Text Retrieval via Image CLIP | Jun 21, 2021 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval | Apr 18, 2021 | RetrievalText Retrieval | CodeCode Available | 1 |
| Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval | Apr 1, 2021 | RetrievalText Retrieval | CodeCode Available | 1 |
| Memory Enhanced Embedding Learning for Cross-Modal Video-Text Retrieval | Mar 29, 2021 | RetrievalText Retrieval | —Unverified | 0 |