| VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts | Nov 3, 2021 | Image RetrievalImage-text Retrieval | CodeCode Available | 1 |
| Constructing Phrase-level Semantic Labels to Form Multi-Grained Supervision for Image-Text Retrieval | Sep 12, 2021 | FormImage-text Retrieval | —Unverified | 0 |
| Multi-stage Pre-training over Simplified Multimodal Pre-training Models | Jul 22, 2021 | Image-text RetrievalRetrieval | CodeCode Available | 0 |
| Align before Fuse: Vision and Language Representation Learning with Momentum Distillation | Jul 16, 2021 | Cross-Modal RetrievalGrounded language learning | CodeCode Available | 1 |
| Dynamic Modality Interaction Modeling for Image-Text Retrieval | Jul 11, 2021 | cross-modal alignmentCross-Modal Retrieval | CodeCode Available | 1 |
| Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training | Jun 25, 2021 | Image-text RetrievalQuestion Answering | —Unverified | 0 |
| CoSMo: Content-Style Modulation for Image Retrieval With Text Feedback | Jun 19, 2021 | Image RetrievalImage-text Retrieval | CodeCode Available | 1 |
| A Deep Local and Global Scene-Graph Matching for Image-Text Retrieval | Jun 4, 2021 | Graph MatchingImage Retrieval | CodeCode Available | 1 |
| Learning Relation Alignment for Calibrated Cross-modal Retrieval | May 28, 2021 | Cross-Modal RetrievalImage-text Retrieval | CodeCode Available | 1 |
| Survey of Visual-Semantic Embedding Methods for Zero-Shot Image Retrieval | May 16, 2021 | Graph GenerationImage Captioning | —Unverified | 0 |