| GAFNet: A Global Fourier Self Attention Based Novel Network for multi-modal downstream tasks | Jan 1, 2023 | Image GenerationImage-text Retrieval | —Unverified | 0 |
| Efficient Image Captioning for Edge Devices | Dec 18, 2022 | CPUImage Captioning | —Unverified | 0 |
| HGAN: Hierarchical Graph Alignment Network for Image-Text Retrieval | Dec 16, 2022 | Image-text RetrievalRetrieval | —Unverified | 0 |
| FlexiViT: One Model for All Patch Sizes | Dec 15, 2022 | AllImage-text Retrieval | CodeCode Available | 1 |
| Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift | Dec 15, 2022 | BenchmarkingImage Captioning | CodeCode Available | 1 |
| NLIP: Noise-robust Language-Image Pre-training | Dec 14, 2022 | Image CaptioningImage-text Retrieval | —Unverified | 0 |
| Scale-Semantic Joint Decoupling Network for Image-text Retrieval in Remote Sensing | Dec 12, 2022 | Cross-Modal RetrievalImage-text Retrieval | —Unverified | 0 |
| Masked Contrastive Pre-Training for Efficient Video-Text Retrieval | Dec 2, 2022 | Image-text RetrievalRetrieval | —Unverified | 0 |
| ComCLIP: Training-Free Compositional Image and Text Matching | Nov 25, 2022 | Image-text matchingImage-text Retrieval | CodeCode Available | 1 |
| Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning | Nov 24, 2022 | cross-modal alignmentImage-text Retrieval | CodeCode Available | 1 |
| Generative Negative Text Replay for Continual Vision-Language Pretraining | Oct 31, 2022 | Continual Learningimage-classification | —Unverified | 0 |
| RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data | Oct 23, 2022 | Image CaptioningImage-text Retrieval | CodeCode Available | 0 |
| Dissecting Deep Metric Learning Losses for Image-Text Retrieval | Oct 21, 2022 | Cross-Modal RetrievalImage-text matching | CodeCode Available | 0 |
| Image-Text Retrieval with Binary and Continuous Label Supervision | Oct 20, 2022 | Image CaptioningImage-text Retrieval | —Unverified | 0 |
| CPL: Counterfactual Prompt Learning for Vision and Language Models | Oct 19, 2022 | counterfactualimage-classification | —Unverified | 0 |
| MedCLIP: Contrastive Learning from Unpaired Medical Images and Text | Oct 18, 2022 | Contrastive LearningImage-text Retrieval | CodeCode Available | 2 |
| Vision-Language Pre-training: Basics, Recent Advances, and Future Trends | Oct 17, 2022 | Few-Shot LearningImage Captioning | CodeCode Available | 3 |
| MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model | Oct 11, 2022 | Contrastive LearningImage-text matching | CodeCode Available | 1 |
| MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning | Oct 9, 2022 | Image-text Retrievalmultimodal interaction | —Unverified | 0 |
| Learning to embed semantic similarity for joint image-text retrieval | Oct 7, 2022 | Image-text RetrievalMetric Learning | —Unverified | 0 |
| Efficient Multilingual Multi-modal Pre-training through Triple Contrastive Loss | Oct 1, 2022 | image-classificationImage Classification | —Unverified | 0 |
| Re-Imagen: Retrieval-Augmented Text-to-Image Generator | Sep 29, 2022 | Image GenerationImage-text Retrieval | —Unverified | 0 |
| Mr. Right: Multimodal Retrieval on Representation of ImaGe witH Text | Sep 28, 2022 | Image CaptioningImage Retrieval | CodeCode Available | 1 |
| VL-Taboo: An Analysis of Attribute-based Zero-shot Capabilities of Vision-Language Models | Sep 12, 2022 | AttributeImage-text Retrieval | CodeCode Available | 0 |
| FETA: Towards Specializing Foundation Models for Expert Task Applications | Sep 8, 2022 | Domain GeneralizationFew-Shot Learning | CodeCode Available | 1 |
| Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment | Aug 29, 2022 | cross-modal alignmentImage-text Retrieval | CodeCode Available | 1 |
| Revising Image-Text Retrieval via Multi-Modal Entailment | Aug 22, 2022 | Image-text RetrievalNatural Language Inference | —Unverified | 0 |
| CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text Retrieval | Aug 21, 2022 | ClusteringContrastive Learning | —Unverified | 0 |
| VLMAE: Vision-Language Masked Autoencoder | Aug 19, 2022 | Image-text RetrievalLanguage Modeling | —Unverified | 0 |
| Intra-Modal Constraint Loss For Image-Text Retrieval | Jul 11, 2022 | Cross-Modal RetrievalImage-text Retrieval | CodeCode Available | 0 |
| Dynamic Contrastive Distillation for Image-Text Retrieval | Jul 4, 2022 | Contrastive LearningGPU | —Unverified | 0 |
| MixGen: A New Multi-Modal Data Augmentation | Jun 16, 2022 | Data AugmentationImage-text Retrieval | CodeCode Available | 1 |
| Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone | Jun 15, 2022 | Described Object DetectionImage Captioning | CodeCode Available | 1 |
| VL-BEiT: Generative Vision-Language Pretraining | Jun 2, 2022 | image-classificationImage Classification | —Unverified | 0 |
| Cross-lingual and Multilingual CLIP | Jun 1, 2022 | Contrastive LearningImage-text Retrieval | CodeCode Available | 2 |
| Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training | Jun 1, 2022 | Contrastive LearningCross-Lingual Transfer | CodeCode Available | 1 |
| Prompt-based Learning for Unpaired Image Captioning | May 26, 2022 | Image CaptioningImage-text Retrieval | —Unverified | 0 |
| Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset | May 25, 2022 | Image CaptioningImage Retrieval | —Unverified | 0 |
| HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval | May 24, 2022 | Cross-Modal RetrievalImage-text Retrieval | —Unverified | 0 |
| mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections | May 24, 2022 | Computational Efficiencycross-modal alignment | CodeCode Available | 1 |
| CCMB: A Large-scale Chinese Cross-modal Benchmark | May 8, 2022 | image-classificationImage Classification | CodeCode Available | 1 |
| Progressive Learning for Image Retrieval with Hybrid-Modality Queries | Apr 24, 2022 | Image RetrievalImage-text Retrieval | —Unverified | 0 |
| COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval | Apr 15, 2022 | Contrastive LearningCross-Modal Retrieval | —Unverified | 0 |
| Robust Cross-Modal Representation Learning with Progressive Self-Distillation | Apr 10, 2022 | Contrastive LearningImage Captioning | —Unverified | 0 |
| Image-text Retrieval: A Survey on Recent Research and Development | Mar 28, 2022 | Image-text RetrievalRetrieval | —Unverified | 0 |
| Single-Stream Multi-Level Alignment for Vision-Language Pretraining | Mar 27, 2022 | Image-text RetrievalQuestion Answering | CodeCode Available | 0 |
| LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text Retrieval | Mar 10, 2022 | Image-text RetrievalRetrieval | —Unverified | 0 |
| Where Does the Performance Improvement Come From? -- A Reproducibility Concern about Image-Text Retrieval | Mar 8, 2022 | Image-text RetrievalInformation Retrieval | CodeCode Available | 1 |
| An Unsupervised Cross-Modal Hashing Method Robust to Noisy Training Image-Text Correspondences in Remote Sensing | Feb 26, 2022 | Image-text RetrievalMeta-Learning | CodeCode Available | 0 |
| Vision-Language Pre-Training with Triple Contrastive Learning | Feb 21, 2022 | Contrastive Learningcross-modal alignment | CodeCode Available | 2 |