| PRIOR: Prototype Representation Joint Learning from Medical Images and Reports | Jul 24, 2023 | Contrastive LearningImage to text | CodeCode Available | 1 |
| Multimodal Procedural Planning via Dual Text-Image Prompting | May 2, 2023 | Image GenerationImage to text | CodeCode Available | 1 |
| Vision-Language Dataset Distillation | Aug 15, 2023 | Dataset Distillationimage-classification | CodeCode Available | 1 |
| Transferable Decoding with Visual Entities for Zero-Shot Image Captioning | Jul 31, 2023 | Caption GenerationHallucination | CodeCode Available | 1 |
| ObjectCompose: Evaluating Resilience of Vision-Based Models on Object-to-Background Compositional Changes | Mar 7, 2024 | Image to textObject | CodeCode Available | 1 |
| Progressive Transformer-Based Generation of Radiology Reports | Feb 19, 2021 | Image to textText Generation | CodeCode Available | 1 |
| LRSCLIP: A Vision-Language Foundation Model for Aligning Remote Sensing Image with Longer Text | Mar 25, 2025 | Cross-Modal RetrievalHallucination | CodeCode Available | 1 |
| DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles | Mar 5, 2025 | Domain AdaptationImage to text | CodeCode Available | 1 |
| Bootstrapping Vision-Language Learning with Decoupled Language Pre-training | Jul 13, 2023 | Image to text | CodeCode Available | 1 |
| L-Verse: Bidirectional Generation Between Image and Text | Nov 22, 2021 | Image CaptioningImage Generation | CodeCode Available | 1 |
| Brain Captioning: Decoding human brain activity into images and text | May 19, 2023 | Brain DecodingDepth Estimation | CodeCode Available | 1 |
| Distilled Dual-Encoder Model for Vision-Language Understanding | Dec 16, 2021 | Image to textmodel | CodeCode Available | 1 |
| Language-Oriented Semantic Latent Representation for Image Transmission | May 16, 2024 | Image to textSemantic Communication | CodeCode Available | 1 |
| Can MLLMs Perform Text-to-Image In-Context Learning? | Feb 2, 2024 | Image GenerationImage to text | CodeCode Available | 1 |
| Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment | Feb 2, 2023 | AttributeFew-Shot Image Classification | CodeCode Available | 1 |
| Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation | Oct 20, 2020 | Image to textNatural Language Inference | CodeCode Available | 1 |
| LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation? | Apr 16, 2024 | Image CaptioningImage Generation | CodeCode Available | 1 |
| Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models | Nov 27, 2023 | Cross-Modal RetrievalImage Generation | CodeCode Available | 1 |
| Linearly Mapping from Image to Text Space | Sep 30, 2022 | Image CaptioningImage to text | CodeCode Available | 1 |
| Discovering Bugs in Vision Models using Off-the-shelf Image Generation and Captioning | Aug 18, 2022 | Image GenerationImage to text | —Unverified | 0 |
| DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding | Dec 2, 2024 | Caption GenerationDomain Generalization | —Unverified | 0 |
| DiffuVST: Narrating Fictional Scenes with Global-History-Guided Denoising Models | Dec 12, 2023 | DenoisingDiversity | —Unverified | 0 |
| Ask, Attend, Attack: A Effective Decision-Based Black-Box Targeted Attack for Image-to-Text Models | Aug 16, 2024 | Image to text | —Unverified | 0 |
| DiffusionSTR: Diffusion Model for Scene Text Recognition | Jun 29, 2023 | Image to textmodel | —Unverified | 0 |
| Development of a New Image-to-text Conversion System for Pashto, Farsi and Traditional Chinese | May 8, 2020 | Image to textOptical Character Recognition (OCR) | —Unverified | 0 |
| Deductron -- A Recurrent Neural Network | Jun 23, 2018 | Image to textOptical Character Recognition (OCR) | —Unverified | 0 |
| DART: Disease-aware Image-Text Alignment and Self-correcting Re-alignment for Trustworthy Radiology Report Generation | Apr 16, 2025 | Contrastive LearningImage to text | —Unverified | 0 |
| An Online Learning Approach to Prompt-based Selection of Generative Models | Oct 17, 2024 | Image to text | —Unverified | 0 |
| Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training | Jan 1, 2025 | Image-text RetrievalImage to text | —Unverified | 0 |
| Dallah: A Dialect-Aware Multimodal Large Language Model for Arabic | Jul 25, 2024 | Image to textLanguage Modeling | —Unverified | 0 |
| Cross-modal Contrastive Attention Model for Medical Report Generation | Oct 1, 2022 | Image to textMedical Report Generation | —Unverified | 0 |
| BIMCV-R: A Landmark Dataset for 3D CT Text-Image Retrieval | Mar 24, 2024 | DiagnosticImage Retrieval | —Unverified | 0 |
| Cross-Modal Alignment with Mixture Experts Neural Network for Intral-City Retail Recommendation | Sep 17, 2020 | cross-modal alignmentImage to text | —Unverified | 0 |
| Cross-Modal Adaptive Dual Association for Text-to-Image Person Retrieval | Dec 4, 2023 | AttributeCross-Modal Person Re-Identification | —Unverified | 0 |
| BiLMa: Bidirectional Local-Matching for Text-based Person Re-identification | Sep 9, 2023 | Image to textLanguage Modeling | —Unverified | 0 |
| An End-to-End Neural Network for Image-to-Audio Transformation | Mar 10, 2023 | Image to texttext-to-speech | —Unverified | 0 |
| COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval | Apr 15, 2022 | Contrastive LearningCross-Modal Retrieval | —Unverified | 0 |
| GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training | Aug 22, 2023 | image-classificationImage Classification | —Unverified | 0 |
| Contrastive Learning of Visual-Semantic Embeddings | Oct 17, 2021 | Contrastive Learningimage-classification | —Unverified | 0 |
| GPC: Generative and General Pathology Image Classifier | Jul 12, 2024 | Classificationimage-classification | —Unverified | 0 |
| Beyond Images: An Integrative Multi-modal Approach to Chest X-Ray Report Generation | Nov 18, 2023 | Image to textSemantic Similarity | —Unverified | 0 |
| ABC: Achieving Better Control of Multimodal Embeddings using VLMs | Mar 1, 2025 | Image to textImage-to-Text Retrieval | —Unverified | 0 |
| Improving Factuality of 3D Brain MRI Report Generation with Paired Image-domain Retrieval and Text-domain Augmentation | Nov 23, 2024 | Cross-Modal RetrievalImage to text | —Unverified | 0 |
| Improving Medical Visual Representation Learning with Pathological-level Cross-Modal Alignment and Correlation Exploration | Jun 12, 2025 | cross-modal alignmentImage to text | —Unverified | 0 |
| Beyond Color and Lines: Zero-Shot Style-Specific Image Variations with Coordinated Semantics | Oct 24, 2024 | Image to textImage-Variation | —Unverified | 0 |
| From Pixels to Prose: Advancing Multi-Modal Language Models for Remote Sensing | Nov 5, 2024 | Change DetectionContrastive Learning | —Unverified | 0 |
| Towards Cross-modal Retrieval in Chinese Cultural Heritage Documents: Dataset and Solution | May 16, 2025 | Cross-Modal RetrievalImage to text | —Unverified | 0 |
| GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks | Nov 2, 2023 | Image GenerationImage to text | —Unverified | 0 |
| Image Semantic Relation Generation | Oct 19, 2022 | Image RetrievalImage Segmentation | —Unverified | 0 |
| From Image to Text in Sentiment Analysis via Regression and Deep Learning | Sep 1, 2019 | Image to textregression | —Unverified | 0 |