| Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages | Aug 23, 2023 | Image GenerationImage to text | CodeCode Available | 6 |
| Versatile Diffusion: Text, Images and Variations All in One Diffusion Model | Nov 15, 2022 | AllDisentanglement | CodeCode Available | 6 |
| FlowTok: Flowing Seamlessly Across Text and Image Tokens | Mar 13, 2025 | DenoisingImage to text | CodeCode Available | 5 |
| Magma: A Foundation Model for Multimodal AI Agents | Feb 18, 2025 | Autonomous Web NavigationImage to text | CodeCode Available | 5 |
| BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | Jan 30, 2023 | Generative Visual Question AnsweringImage Captioning | CodeCode Available | 4 |
| Evaluating Text-to-Visual Generation with Image-to-Text Generation | Apr 1, 2024 | Image to textQuestion Answering | CodeCode Available | 3 |
| Emu: Generative Pretraining in Multimodality | Jul 11, 2023 | Image CaptioningImage Generation | CodeCode Available | 3 |
| One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale | Mar 12, 2023 | AllImage Generation | CodeCode Available | 3 |
| Semantic Editing Increment Benefits Zero-Shot Composed Image Retrieval | Oct 28, 2024 | Image RetrievalImage to text | CodeCode Available | 2 |
| In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation | Aug 9, 2024 | Image to textObject | CodeCode Available | 2 |
| Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities | Jul 29, 2024 | Contrastive LearningDeepFake Detection | CodeCode Available | 2 |
| LDRE: LLM-based Divergent Reasoning and Ensemble for Zero-Shot Composed Image Retrieval | Jul 11, 2024 | Image RetrievalImage to text | CodeCode Available | 2 |
| Libra: Building Decoupled Vision System on Large Language Models | May 16, 2024 | Image to textLanguage Modeling | CodeCode Available | 2 |
| CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching | Apr 4, 2024 | AttributeImage Captioning | CodeCode Available | 2 |
| From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models | Apr 1, 2024 | Graph GenerationImage to text | CodeCode Available | 2 |
| Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering | Sep 29, 2023 | Image to textPassage Retrieval | CodeCode Available | 2 |
| Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning | Sep 5, 2023 | DecoderImage Generation | CodeCode Available | 2 |
| Planting a SEED of Vision in Large Language Model | Jul 16, 2023 | Image GenerationImage to text | CodeCode Available | 2 |
| Generative Diffusion Models on Graphs: Methods and Applications | Feb 6, 2023 | DenoisingGraph Generation | CodeCode Available | 2 |
| Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding | Oct 7, 2022 | Chart Question AnsweringDiversity | CodeCode Available | 2 |
| GIT: A Generative Image-to-text Transformer for Vision and Language | May 27, 2022 | DecoderImage Captioning | CodeCode Available | 2 |
| Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models | Jun 10, 2025 | Contrastive LearningImage-text matching | CodeCode Available | 1 |
| LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMs | Apr 11, 2025 | BenchmarkingImage Generation | CodeCode Available | 1 |
| LRSCLIP: A Vision-Language Foundation Model for Aligning Remote Sensing Image with Longer Text | Mar 25, 2025 | Cross-Modal RetrievalHallucination | CodeCode Available | 1 |
| DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles | Mar 5, 2025 | Domain AdaptationImage to text | CodeCode Available | 1 |
| UniCMs: A Unified Consistency Model For Efficient Multimodal Generation and Understanding | Feb 8, 2025 | DenoisingImage Generation | CodeCode Available | 1 |
| Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs? | Jan 5, 2025 | Image CaptioningImage to text | CodeCode Available | 1 |
| FLAME: Frozen Large Language Models Enable Data-Efficient Language-Image Pre-training | Nov 18, 2024 | Data AugmentationImage to text | CodeCode Available | 1 |
| See or Guess: Counterfactually Regularized Image Captioning | Aug 29, 2024 | Causal Inferencecounterfactual | CodeCode Available | 1 |
| UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation | Aug 21, 2024 | Image GenerationImage Retrieval | CodeCode Available | 1 |
| CMC-Bench: Towards a New Paradigm of Visual Signal Compression | Jun 13, 2024 | Image CompressionImage to text | CodeCode Available | 1 |
| Cephalo: Multi-Modal Vision-Language Models for Bio-Inspired Materials Analysis and Design | May 29, 2024 | Dataset GenerationImage to text | CodeCode Available | 1 |
| Language-Oriented Semantic Latent Representation for Image Transmission | May 16, 2024 | Image to textSemantic Communication | CodeCode Available | 1 |
| LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation? | Apr 16, 2024 | Image CaptioningImage Generation | CodeCode Available | 1 |
| ObjectCompose: Evaluating Resilience of Vision-Based Models on Object-to-Background Compositional Changes | Mar 7, 2024 | Image to textObject | CodeCode Available | 1 |
| Can MLLMs Perform Text-to-Image In-Context Learning? | Feb 2, 2024 | Image GenerationImage to text | CodeCode Available | 1 |
| Benchmarking Large Multimodal Models against Common Corruptions | Jan 22, 2024 | BenchmarkingImage to text | CodeCode Available | 1 |
| Improving Image Restoration through Removing Degradations in Textual Representations | Dec 28, 2023 | DeblurringDenoising | CodeCode Available | 1 |
| Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models | Nov 27, 2023 | Cross-Modal RetrievalImage Generation | CodeCode Available | 1 |
| UrbanCLIP: Learning Text-enhanced Urban Region Profiling with Contrastive Language-Image Pretraining from the Web | Oct 22, 2023 | Image to textLanguage Modeling | CodeCode Available | 1 |
| Symmetrical Linguistic Feature Distillation with CLIP for Scene Text Recognition | Oct 8, 2023 | Image to textOptical Character Recognition (OCR) | CodeCode Available | 1 |
| Multimodal Foundation Models For Echocardiogram Interpretation | Aug 29, 2023 | Cross-Modal RetrievalDiagnostic | CodeCode Available | 1 |
| Beyond One-to-One: Rethinking the Referring Image Segmentation | Aug 26, 2023 | DecoderImage Segmentation | CodeCode Available | 1 |
| Vision-Language Dataset Distillation | Aug 15, 2023 | Dataset Distillationimage-classification | CodeCode Available | 1 |
| Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval | Aug 8, 2023 | Cross-Modal RetrievalImage Retrieval | CodeCode Available | 1 |
| Transferable Decoding with Visual Entities for Zero-Shot Image Captioning | Jul 31, 2023 | Caption GenerationHallucination | CodeCode Available | 1 |
| PRIOR: Prototype Representation Joint Learning from Medical Images and Reports | Jul 24, 2023 | Contrastive LearningImage to text | CodeCode Available | 1 |
| Bootstrapping Vision-Language Learning with Decoupled Language Pre-training | Jul 13, 2023 | Image to text | CodeCode Available | 1 |
| Brain Captioning: Decoding human brain activity into images and text | May 19, 2023 | Brain DecodingDepth Estimation | CodeCode Available | 1 |
| What You See is What You Read? Improving Text-Image Alignment Evaluation | May 17, 2023 | Image GenerationImage to text | CodeCode Available | 1 |