| GAFNet: A Global Fourier Self Attention Based Novel Network for multi-modal downstream tasks | Jan 1, 2023 | Image GenerationImage-text Retrieval | —Unverified | 0 |
| Efficient Image Captioning for Edge Devices | Dec 18, 2022 | CPUImage Captioning | —Unverified | 0 |
| HGAN: Hierarchical Graph Alignment Network for Image-Text Retrieval | Dec 16, 2022 | Image-text RetrievalRetrieval | —Unverified | 0 |
| FlexiViT: One Model for All Patch Sizes | Dec 15, 2022 | AllImage-text Retrieval | CodeCode Available | 1 |
| Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift | Dec 15, 2022 | BenchmarkingImage Captioning | CodeCode Available | 1 |
| NLIP: Noise-robust Language-Image Pre-training | Dec 14, 2022 | Image CaptioningImage-text Retrieval | —Unverified | 0 |
| Scale-Semantic Joint Decoupling Network for Image-text Retrieval in Remote Sensing | Dec 12, 2022 | Cross-Modal RetrievalImage-text Retrieval | —Unverified | 0 |
| Masked Contrastive Pre-Training for Efficient Video-Text Retrieval | Dec 2, 2022 | Image-text RetrievalRetrieval | —Unverified | 0 |
| ComCLIP: Training-Free Compositional Image and Text Matching | Nov 25, 2022 | Image-text matchingImage-text Retrieval | CodeCode Available | 1 |
| Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning | Nov 24, 2022 | cross-modal alignmentImage-text Retrieval | CodeCode Available | 1 |
| Generative Negative Text Replay for Continual Vision-Language Pretraining | Oct 31, 2022 | Continual Learningimage-classification | —Unverified | 0 |
| RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data | Oct 23, 2022 | Image CaptioningImage-text Retrieval | CodeCode Available | 0 |
| Dissecting Deep Metric Learning Losses for Image-Text Retrieval | Oct 21, 2022 | Cross-Modal RetrievalImage-text matching | CodeCode Available | 0 |
| Image-Text Retrieval with Binary and Continuous Label Supervision | Oct 20, 2022 | Image CaptioningImage-text Retrieval | —Unverified | 0 |
| CPL: Counterfactual Prompt Learning for Vision and Language Models | Oct 19, 2022 | counterfactualimage-classification | —Unverified | 0 |
| MedCLIP: Contrastive Learning from Unpaired Medical Images and Text | Oct 18, 2022 | Contrastive LearningImage-text Retrieval | CodeCode Available | 2 |
| Vision-Language Pre-training: Basics, Recent Advances, and Future Trends | Oct 17, 2022 | Few-Shot LearningImage Captioning | CodeCode Available | 3 |
| MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model | Oct 11, 2022 | Contrastive LearningImage-text matching | CodeCode Available | 1 |
| MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning | Oct 9, 2022 | Image-text Retrievalmultimodal interaction | —Unverified | 0 |
| Learning to embed semantic similarity for joint image-text retrieval | Oct 7, 2022 | Image-text RetrievalMetric Learning | —Unverified | 0 |
| Efficient Multilingual Multi-modal Pre-training through Triple Contrastive Loss | Oct 1, 2022 | image-classificationImage Classification | —Unverified | 0 |
| Re-Imagen: Retrieval-Augmented Text-to-Image Generator | Sep 29, 2022 | Image GenerationImage-text Retrieval | —Unverified | 0 |
| Mr. Right: Multimodal Retrieval on Representation of ImaGe witH Text | Sep 28, 2022 | Image CaptioningImage Retrieval | CodeCode Available | 1 |
| VL-Taboo: An Analysis of Attribute-based Zero-shot Capabilities of Vision-Language Models | Sep 12, 2022 | AttributeImage-text Retrieval | CodeCode Available | 0 |
| FETA: Towards Specializing Foundation Models for Expert Task Applications | Sep 8, 2022 | Domain GeneralizationFew-Shot Learning | CodeCode Available | 1 |