| CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval | Feb 15, 2022 | Image-text RetrievalRepresentation Learning | —Unverified | 0 |
| Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark | Feb 14, 2022 | BenchmarkingContrastive Learning | CodeCode Available | 0 |
| BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation | Jan 28, 2022 | Image CaptioningImage-text matching | CodeCode Available | 5 |
| Negative Sample is Negative in Its Own Way: Tailoring Negative Sentences for Image-Text Retrieval | Dec 17, 2021 | Image-text RetrievalRetrieval | —Unverified | 0 |
| Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation | Dec 10, 2021 | Image-text matchingImage-text Retrieval | —Unverified | 0 |
| UFO: A UniFied TransfOrmer for Vision-Language Representation Learning | Nov 19, 2021 | Image CaptioningImage-text matching | —Unverified | 0 |
| Constructing Phrase-level Semantic Labels to Form Multi-GrainedSupervision for Image-Text Retrieval | Nov 16, 2021 | FormImage-text Retrieval | —Unverified | 0 |
| SwAMP: Swapped Assignment of Multi-Modal Pairs for Cross-Modal Retrieval | Nov 10, 2021 | Contrastive LearningCross-Modal Retrieval | —Unverified | 0 |
| FILIP: Fine-grained Interactive Language-Image Pre-Training | Nov 9, 2021 | image-classificationImage Classification | CodeCode Available | 1 |
| Negative Sample is Negative in Its Own Way: Tailoring Negative Sentences for Image-Text Retrieval | Nov 5, 2021 | Image-text RetrievalRetrieval | CodeCode Available | 0 |
| VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts | Nov 3, 2021 | Image RetrievalImage-text Retrieval | CodeCode Available | 1 |
| Constructing Phrase-level Semantic Labels to Form Multi-Grained Supervision for Image-Text Retrieval | Sep 12, 2021 | FormImage-text Retrieval | —Unverified | 0 |
| Multi-stage Pre-training over Simplified Multimodal Pre-training Models | Jul 22, 2021 | Image-text RetrievalRetrieval | CodeCode Available | 0 |
| Align before Fuse: Vision and Language Representation Learning with Momentum Distillation | Jul 16, 2021 | Cross-Modal RetrievalGrounded language learning | CodeCode Available | 1 |
| Dynamic Modality Interaction Modeling for Image-Text Retrieval | Jul 11, 2021 | cross-modal alignmentCross-Modal Retrieval | CodeCode Available | 1 |
| Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training | Jun 25, 2021 | Image-text RetrievalQuestion Answering | —Unverified | 0 |
| CoSMo: Content-Style Modulation for Image Retrieval With Text Feedback | Jun 19, 2021 | Image RetrievalImage-text Retrieval | CodeCode Available | 1 |
| A Deep Local and Global Scene-Graph Matching for Image-Text Retrieval | Jun 4, 2021 | Graph MatchingImage Retrieval | CodeCode Available | 1 |
| Learning Relation Alignment for Calibrated Cross-modal Retrieval | May 28, 2021 | Cross-Modal RetrievalImage-text Retrieval | CodeCode Available | 1 |
| Survey of Visual-Semantic Embedding Methods for Zero-Shot Image Retrieval | May 16, 2021 | Graph GenerationImage Captioning | —Unverified | 0 |
| Playing Lottery Tickets with Vision and Language | Apr 23, 2021 | Image-text RetrievalQuestion Answering | —Unverified | 0 |
| Continual learning in cross-modal retrieval | Apr 14, 2021 | Continual Learningcross-modal alignment | —Unverified | 0 |
| UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training | Apr 1, 2021 | Image-text matchingImage-text Retrieval | —Unverified | 0 |
| LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval | Mar 16, 2021 | Image-text RetrievalRe-Ranking | CodeCode Available | 1 |
| WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning | Mar 2, 2021 | BIG-bench Machine LearningImage Retrieval | CodeCode Available | 2 |
| Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision | Feb 11, 2021 | Cross-Modal RetrievalFine-Grained Image Classification | CodeCode Available | 2 |
| GLoRIA: A Multimodal Global-Local Representation Learning Framework for Label-Efficient Medical Image Recognition | Jan 1, 2021 | Image-text RetrievalMedical Image Analysis | CodeCode Available | 1 |
| Learning the Best Pooling Strategy for Visual Semantic Embedding | Nov 9, 2020 | Cross-Modal Information RetrievalImage-text Retrieval | CodeCode Available | 1 |
| A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports | Sep 3, 2020 | Image-text RetrievalMedical Visual Question Answering | CodeCode Available | 1 |
| Graph Optimal Transport for Cross-Domain Alignment | Jun 26, 2020 | Graph MatchingImage Captioning | CodeCode Available | 1 |
| Large-Scale Adversarial Training for Vision-and-Language Representation Learning | Jun 11, 2020 | Image-text RetrievalQuestion Answering | CodeCode Available | 1 |
| Learning Multi-Modal Nonlinear Embeddings: Performance Bounds and an Algorithm | Jun 3, 2020 | cross-modal alignmentGeneral Classification | —Unverified | 0 |
| Context-Aware Attention Network for Image-Text Retrieval | Jun 1, 2020 | Image-text RetrievalRetrieval | —Unverified | 0 |
| Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers | Apr 2, 2020 | Image-text matchingImage-text Retrieval | CodeCode Available | 1 |
| IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval | Mar 8, 2020 | Cross-Modal RetrievalImage-text Retrieval | CodeCode Available | 1 |
| XGPT: Cross-modal Generative Pre-Training for Image Captioning | Mar 3, 2020 | Data AugmentationDenoising | —Unverified | 0 |
| MHSAN: Multi-Head Self-Attention Network for Visual Semantic Embedding | Jan 11, 2020 | Image CaptioningImage-text Retrieval | CodeCode Available | 0 |
| Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval | Oct 11, 2019 | Graph MatchingImage-text Retrieval | CodeCode Available | 1 |
| UNITER: Learning UNiversal Image-TExt Representations | Sep 25, 2019 | Image-text matchingImage-text Retrieval | —Unverified | 0 |
| UNITER: UNiversal Image-TExt Representation Learning | Sep 25, 2019 | Image-text matchingImage-text Retrieval | CodeCode Available | 1 |
| Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training | Aug 16, 2019 | Image-text matchingImage-text Retrieval | —Unverified | 0 |
| The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision | Apr 26, 2019 | Image-text RetrievalObject | CodeCode Available | 0 |
| Deep Semantic Multimodal Hashing Network for Scalable Image-Text and Video-Text Retrievals | Jan 9, 2019 | Cross-Modal RetrievalDeep Hashing | —Unverified | 0 |
| Webly Supervised Joint Embedding for Cross-Modal lmage-Text Retrieval | Oct 1, 2018 | Cross-Modal RetrievalImage-text Retrieval | —Unverified | 0 |
| Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval | Aug 23, 2018 | Cross-Modal RetrievalImage-text Retrieval | —Unverified | 0 |
| Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval | Jun 11, 2018 | Image-text RetrievalRetrieval | CodeCode Available | 0 |
| Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models | Nov 17, 2017 | Cross-Modal RetrievalImage-text Retrieval | —Unverified | 0 |
| Asymmetrically Weighted CCA And Hierarchical Kernel Sentence Embedding For Image & Text Retrieval | Nov 19, 2015 | Image-text RetrievalModel Selection | —Unverified | 0 |