| SLAN: Self-Locator Aided Network for Vision-Language Understanding | Jan 1, 2023 | Image RetrievalImage to text | —Unverified | 0 |
| Do DALL-E and Flamingo Understand Each Other? | Dec 23, 2022 | Image CaptioningImage Generation | —Unverified | 0 |
| When are Lemons Purple? The Concept Association Bias of Vision-Language Models | Dec 22, 2022 | Attributeimage-classification | —Unverified | 0 |
| MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering | Dec 19, 2022 | Chart Question AnsweringData Summarization | —Unverified | 0 |
| SLAN: Self-Locator Aided Network for Cross-Modal Understanding | Nov 28, 2022 | Image RetrievalImage to text | —Unverified | 0 |
| Retrieval-Augmented Multimodal Language Modeling | Nov 22, 2022 | Caption GenerationImage Captioning | —Unverified | 0 |
| Versatile Diffusion: Text, Images and Variations All in One Diffusion Model | Nov 15, 2022 | AllDisentanglement | CodeCode Available | 6 |
| Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models | Nov 9, 2022 | Image GenerationImage to text | CodeCode Available | 1 |
| Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision | Oct 24, 2022 | cross-modal alignmentCross-Modal Retrieval | —Unverified | 0 |
| Improving the Factual Correctness of Radiology Report Generation with Semantic Rewards | Oct 21, 2022 | Image to textnamed-entity-recognition | —Unverified | 0 |
| Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation | Oct 20, 2022 | DecoderImage Captioning | CodeCode Available | 1 |
| Image Semantic Relation Generation | Oct 19, 2022 | Image RetrievalImage Segmentation | —Unverified | 0 |
| Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding | Oct 7, 2022 | Chart Question AnsweringDiversity | CodeCode Available | 2 |
| Cross-modal Contrastive Attention Model for Medical Report Generation | Oct 1, 2022 | Image to textMedical Report Generation | —Unverified | 0 |
| Linearly Mapping from Image to Text Space | Sep 30, 2022 | Image CaptioningImage to text | CodeCode Available | 1 |
| FETA: Towards Specializing Foundation Models for Expert Task Applications | Sep 8, 2022 | Domain GeneralizationFew-Shot Learning | CodeCode Available | 1 |
| Every picture tells a story: Image-grounded controllable stylistic story generation | Sep 4, 2022 | Image CaptioningImage to text | —Unverified | 0 |
| Discovering Bugs in Vision Models using Off-the-shelf Image Generation and Captioning | Aug 18, 2022 | Image GenerationImage to text | —Unverified | 0 |
| Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text Retrieval | Jul 29, 2022 | Cross-Modal RetrievalData Augmentation | —Unverified | 0 |
| SRCB at SemEval-2022 Task 5: Pretraining Based Image to Text Late Sequential Fusion System for Multimodal Misogynous Meme Identification | Jul 1, 2022 | Image to text | —Unverified | 0 |
| What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs | Jun 19, 2022 | BenchmarkingImage Captioning | CodeCode Available | 1 |
| Write and Paint: Generative Vision-Language Models are Unified Modal Learners | Jun 15, 2022 | Image GenerationImage to text | CodeCode Available | 1 |
| Delving into the Openness of CLIP | Jun 4, 2022 | image-classificationImage Classification | CodeCode Available | 0 |
| Multilingual Image Corpus – Towards a Multimodal and Multilingual Dataset | Jun 1, 2022 | Caption Generationimage-classification | —Unverified | 0 |
| GIT: A Generative Image-to-text Transformer for Vision and Language | May 27, 2022 | DecoderImage Captioning | CodeCode Available | 2 |