| Multimodal Procedural Planning via Dual Text-Image Prompting | May 2, 2023 | Image GenerationImage to text | CodeCode Available | 1 |
| MAGVLT: Masked Generative Vision-and-Language Transformer | Mar 21, 2023 | Image CaptioningImage Generation | CodeCode Available | 1 |
| ZeroNLG: Aligning and Autoencoding Domains for Zero-Shot Multimodal and Multilingual Natural Language Generation | Mar 11, 2023 | Image CaptioningImage to text | CodeCode Available | 1 |
| Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts | Feb 17, 2023 | Image RetrievalImage-text Classification | CodeCode Available | 1 |
| Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment | Feb 2, 2023 | AttributeFew-Shot Image Classification | CodeCode Available | 1 |
| Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models | Nov 9, 2022 | Image GenerationImage to text | CodeCode Available | 1 |
| Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation | Oct 20, 2022 | DecoderImage Captioning | CodeCode Available | 1 |
| Linearly Mapping from Image to Text Space | Sep 30, 2022 | Image CaptioningImage to text | CodeCode Available | 1 |
| FETA: Towards Specializing Foundation Models for Expert Task Applications | Sep 8, 2022 | Domain GeneralizationFew-Shot Learning | CodeCode Available | 1 |
| What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs | Jun 19, 2022 | BenchmarkingImage Captioning | CodeCode Available | 1 |
| Write and Paint: Generative Vision-Language Models are Unified Modal Learners | Jun 15, 2022 | Image GenerationImage to text | CodeCode Available | 1 |
| ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation | Dec 31, 2021 | Image CaptioningImage Generation | CodeCode Available | 1 |
| Distilled Dual-Encoder Model for Vision-Language Understanding | Dec 16, 2021 | Image to textmodel | CodeCode Available | 1 |
| ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic | Nov 29, 2021 | Contrastive LearningDescriptive | CodeCode Available | 1 |
| L-Verse: Bidirectional Generation Between Image and Text | Nov 22, 2021 | Image CaptioningImage Generation | CodeCode Available | 1 |
| Unifying Multimodal Transformer for Bi-directional Image and Text Generation | Oct 19, 2021 | Image GenerationImage to text | CodeCode Available | 1 |
| Concadia: Towards Image-Based Text Generation with a Purpose | Apr 16, 2021 | Image CaptioningImage to text | CodeCode Available | 1 |
| Progressive Transformer-Based Generation of Radiology Reports | Feb 19, 2021 | Image to textText Generation | CodeCode Available | 1 |
| Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation | Oct 20, 2020 | Image to textNatural Language Inference | CodeCode Available | 1 |
| Improving Medical Visual Representation Learning with Pathological-level Cross-Modal Alignment and Correlation Exploration | Jun 12, 2025 | cross-modal alignmentImage to text | —Unverified | 0 |
| ChartReasoner: Code-Driven Modality Bridging for Long-Chain Reasoning in Chart Question Answering | Jun 11, 2025 | Chart Question AnsweringImage to text | —Unverified | 0 |
| TNG-CLIP:Training-Time Negation Data Generation for Negation Awareness of CLIP | May 24, 2025 | Image CaptioningImage Generation | —Unverified | 0 |
| BRIT: Bidirectional Retrieval over Unified Image-Text Graph | May 24, 2025 | Image to textQuestion Answering | —Unverified | 0 |
| Robustifying Vision-Language Models via Dynamic Token Reweighting | May 22, 2025 | Image to text | —Unverified | 0 |
| UniMoCo: Unified Modality Completion for Robust Multi-Modal Embeddings | May 17, 2025 | Image to textInformation Retrieval | CodeCode Available | 0 |
| Towards Cross-modal Retrieval in Chinese Cultural Heritage Documents: Dataset and Solution | May 16, 2025 | Cross-Modal RetrievalImage to text | —Unverified | 0 |
| X-Fusion: Introducing New Modality to Frozen Large Language Models | Apr 29, 2025 | Image to text | —Unverified | 0 |
| SemCORE: A Semantic-Enhanced Generative Cross-Modal Retrieval Framework with MLLMs | Apr 17, 2025 | Cross-Modal RetrievalImage Retrieval | —Unverified | 0 |
| DART: Disease-aware Image-Text Alignment and Self-correcting Re-alignment for Trustworthy Radiology Report Generation | Apr 16, 2025 | Contrastive LearningImage to text | —Unverified | 0 |
| TMCIR: Token Merge Benefits Composed Image Retrieval | Apr 15, 2025 | Contrastive Learningcross-modal alignment | —Unverified | 0 |
| Image-to-Text for Medical Reports Using Adaptive Co-Attention and Triple-LSTM Module | Mar 24, 2025 | Image to textMedical Report Generation | —Unverified | 0 |
| Natural Language Generation | Mar 20, 2025 | Image CaptioningImage to text | —Unverified | 0 |
| PromptHash: Affinity-Prompted Collaborative Cross-Modal Learning for Adaptive Hashing Retrieval | Mar 20, 2025 | Contrastive LearningCross-Modal Retrieval | CodeCode Available | 0 |
| Real-world validation of a multimodal LLM-powered pipeline for High-Accuracy Clinical Trial Patient Matching leveraging EHR data | Mar 19, 2025 | Image to text | CodeCode Available | 0 |
| MFP-CLIP: Exploring the Efficacy of Multi-Form Prompts for Zero-Shot Industrial Anomaly Detection | Mar 17, 2025 | Anomaly DetectionForm | —Unverified | 0 |
| ABC: Achieving Better Control of Multimodal Embeddings using VLMs | Mar 1, 2025 | Image to textImage-to-Text Retrieval | —Unverified | 0 |
| On the Importance of Text Preprocessing for Multimodal Representation Learning and Pathology Report Generation | Feb 26, 2025 | Cross-Modal RetrievalHallucination | —Unverified | 0 |
| Natural Language Generation from Visual Sequences: Challenges and Future Directions | Feb 18, 2025 | Image to textText Generation | —Unverified | 0 |
| Reading the unreadable: Creating a dataset of 19th century English newspapers using image-to-text language models | Feb 18, 2025 | Image to textOptical Character Recognition | CodeCode Available | 0 |
| UNITE-FND: Reframing Multimodal Fake News Detection through Unimodal Scene Translation | Feb 16, 2025 | Binary ClassificationFake News Detection | —Unverified | 0 |
| Multi-LLM Collaborative Caption Generation in Scientific Documents | Jan 5, 2025 | Caption GenerationImage to text | CodeCode Available | 0 |
| Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training | Jan 1, 2025 | Image-text RetrievalImage to text | —Unverified | 0 |
| Retaining Knowledge and Enhancing Long-Text Representations in CLIP through Dual-Teacher Distillation | Jan 1, 2025 | image-classificationImage Classification | —Unverified | 0 |
| PromptHash:Affinity-Prompted Collaborative Cross-Modal Learning for Adaptive Hashing Retrieval | Jan 1, 2025 | Contrastive LearningImage Retrieval | CodeCode Available | 0 |
| Survey on Abstractive Text Summarization: Dataset, Models, and Metrics | Dec 22, 2024 | Abstractive Text SummarizationGeneral Knowledge | CodeCode Available | 0 |
| CLIP-FSAC++: Few-Shot Anomaly Classification with Anomaly Descriptor Based on CLIP | Dec 5, 2024 | Anomaly ClassificationAnomaly Detection | CodeCode Available | 0 |
| DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding | Dec 2, 2024 | Caption GenerationDomain Generalization | —Unverified | 0 |
| Improving Factuality of 3D Brain MRI Report Generation with Paired Image-domain Retrieval and Text-domain Augmentation | Nov 23, 2024 | Cross-Modal RetrievalImage to text | —Unverified | 0 |
| Everything is a Video: Unifying Modalities through Next-Frame Prediction | Nov 15, 2024 | Caption GenerationCross-Modal Retrieval | —Unverified | 0 |
| Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models | Nov 8, 2024 | Image CaptioningImage Generation | —Unverified | 0 |