| Dallah: A Dialect-Aware Multimodal Large Language Model for Arabic | Jul 25, 2024 | Image to textLanguage Modeling | —Unverified | 0 | 0 |
| DART: Disease-aware Image-Text Alignment and Self-correcting Re-alignment for Trustworthy Radiology Report Generation | Apr 16, 2025 | Contrastive LearningImage to text | —Unverified | 0 | 0 |
| Deductron -- A Recurrent Neural Network | Jun 23, 2018 | Image to textOptical Character Recognition (OCR) | —Unverified | 0 | 0 |
| Development of a New Image-to-text Conversion System for Pashto, Farsi and Traditional Chinese | May 8, 2020 | Image to textOptical Character Recognition (OCR) | —Unverified | 0 | 0 |
| DiffusionSTR: Diffusion Model for Scene Text Recognition | Jun 29, 2023 | Image to textmodel | —Unverified | 0 | 0 |
| DiffuVST: Narrating Fictional Scenes with Global-History-Guided Denoising Models | Dec 12, 2023 | DenoisingDiversity | —Unverified | 0 | 0 |
| DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding | Dec 2, 2024 | Caption GenerationDomain Generalization | —Unverified | 0 | 0 |
| Discovering Bugs in Vision Models using Off-the-shelf Image Generation and Captioning | Aug 18, 2022 | Image GenerationImage to text | —Unverified | 0 | 0 |
| Doc2Im: document to image conversion through self-attentive embedding | Nov 8, 2018 | Document To Image Conversiondocument understanding | —Unverified | 0 | 0 |
| DOCCI: Descriptions of Connected and Contrasting Images | Apr 30, 2024 | Image GenerationImage to text | —Unverified | 0 | 0 |
| Do DALL-E and Flamingo Understand Each Other? | Dec 23, 2022 | Image CaptioningImage Generation | —Unverified | 0 | 0 |
| Do LLMs Understand Visual Anomalies? Uncovering LLM's Capabilities in Zero-shot Anomaly Detection | Apr 15, 2024 | Anomaly DetectionAnomaly Localization | —Unverified | 0 | 0 |
| Dynamic Traceback Learning for Medical Report Generation | Jan 24, 2024 | Image to textMedical Report Generation | —Unverified | 0 | 0 |
| Efficient End-to-End Visual Document Understanding with Rationale Distillation | Nov 16, 2023 | document understandingImage to text | —Unverified | 0 | 0 |
| EI-CLIP: Entity-Aware Interventional Contrastive Learning for E-Commerce Cross-Modal Retrieval | Jan 1, 2022 | Causal InferenceContrastive Learning | —Unverified | 0 | 0 |
| EmojiGAN: learning emojis distributions with a generative model | Oct 1, 2018 | Image CaptioningImage to text | —Unverified | 0 | 0 |
| Enhancing Vision-Language Pre-training with Rich Supervisions | Mar 5, 2024 | Image to textTable Detection | —Unverified | 0 | 0 |
| Evaluating authenticity and quality of image captions via sentiment and semantic analyses | Sep 14, 2024 | Image CaptioningImage to text | —Unverified | 0 | 0 |
| Every picture tells a story: Image-grounded controllable stylistic story generation | Sep 4, 2022 | Image CaptioningImage to text | —Unverified | 0 | 0 |
| Everything is a Video: Unifying Modalities through Next-Frame Prediction | Nov 15, 2024 | Caption GenerationCross-Modal Retrieval | —Unverified | 0 | 0 |
| Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation | Mar 14, 2024 | Image to textOptical Character Recognition (OCR) | —Unverified | 0 | 0 |
| Faithful Chart Summarization with ChaTS-Pi | May 29, 2024 | Image to textSentence | —Unverified | 0 | 0 |
| Fetch-A-Set: A Large-Scale OCR-Free Benchmark for Historical Document Retrieval | Jun 11, 2024 | Image RetrievalImage to text | —Unverified | 0 | 0 |
| From Image to Text Classification: A Novel Approach based on Clustering Word Embeddings | Jul 25, 2017 | ClusteringGeneral Classification | —Unverified | 0 | 0 |
| From Image to Text in Sentiment Analysis via Regression and Deep Learning | Sep 1, 2019 | Image to textregression | —Unverified | 0 | 0 |