| Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages | Aug 23, 2023 | Image GenerationImage to text | CodeCode Available | 6 |
| Versatile Diffusion: Text, Images and Variations All in One Diffusion Model | Nov 15, 2022 | AllDisentanglement | CodeCode Available | 6 |
| FlowTok: Flowing Seamlessly Across Text and Image Tokens | Mar 13, 2025 | DenoisingImage to text | CodeCode Available | 5 |
| Magma: A Foundation Model for Multimodal AI Agents | Feb 18, 2025 | Autonomous Web NavigationImage to text | CodeCode Available | 5 |
| BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | Jan 30, 2023 | Generative Visual Question AnsweringImage Captioning | CodeCode Available | 4 |
| Evaluating Text-to-Visual Generation with Image-to-Text Generation | Apr 1, 2024 | Image to textQuestion Answering | CodeCode Available | 3 |
| Emu: Generative Pretraining in Multimodality | Jul 11, 2023 | Image CaptioningImage Generation | CodeCode Available | 3 |
| One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale | Mar 12, 2023 | AllImage Generation | CodeCode Available | 3 |
| Semantic Editing Increment Benefits Zero-Shot Composed Image Retrieval | Oct 28, 2024 | Image RetrievalImage to text | CodeCode Available | 2 |
| In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation | Aug 9, 2024 | Image to textObject | CodeCode Available | 2 |