| Learning Deep Structure-Preserving Image-Text Embeddings | Nov 19, 2015 | Image RetrievalImage to text | —Unverified | 0 | 0 |
| Learning Pseudo-Labeler beyond Noun Concepts for Open-Vocabulary Object Detection | Dec 4, 2023 | Image to textobject-detection | —Unverified | 0 | 0 |
| Leveraging AI to Generate Audio for User-generated Content in Video Games | Apr 25, 2024 | Audio GenerationGame Design | —Unverified | 0 | 0 |
| Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency | Oct 5, 2023 | Image GenerationImage to text | —Unverified | 0 | 0 |
| MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering | Dec 19, 2022 | Chart Question AnsweringData Summarization | —Unverified | 0 | 0 |
| MedM2G: Unifying Medical Multi-Modal Generation via Cross-Guided Diffusion with Visual Invariant | Mar 7, 2024 | Clinical KnowledgeImage to text | —Unverified | 0 | 0 |
| MFP-CLIP: Exploring the Efficacy of Multi-Form Prompts for Zero-Shot Industrial Anomaly Detection | Mar 17, 2025 | Anomaly DetectionForm | —Unverified | 0 | 0 |
| Category-Oriented Representation Learning for Image to Multi-Modal Retrieval | May 6, 2023 | Cross-Modal RetrievalImage Retrieval | —Unverified | 0 | 0 |
| Multilingual Image Corpus – Towards a Multimodal and Multilingual Dataset | Jun 1, 2022 | Caption Generationimage-classification | —Unverified | 0 | 0 |
| Multimodal Intelligence: Representation Learning, Information Fusion, and Applications | Nov 10, 2019 | Caption GenerationImage Generation | —Unverified | 0 | 0 |
| Multimodal Neurons in Pretrained Text-Only Transformers | Aug 3, 2023 | Image CaptioningImage to text | —Unverified | 0 | 0 |
| Natural Language Generation | Mar 20, 2025 | Image CaptioningImage to text | —Unverified | 0 | 0 |
| Natural Language Generation from Visual Sequences: Challenges and Future Directions | Feb 18, 2025 | Image to textText Generation | —Unverified | 0 | 0 |
| Offline Detection of Misspelled Handwritten Words by Convolving Recognition Model Features with Text Labels | Sep 18, 2023 | Generative Adversarial NetworkHandwriting Recognition | —Unverified | 0 | 0 |
| On the Importance of Text Preprocessing for Multimodal Representation Learning and Pathology Report Generation | Feb 26, 2025 | Cross-Modal RetrievalHallucination | —Unverified | 0 | 0 |
| OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation | Apr 1, 2024 | Image SegmentationImage to text | —Unverified | 0 | 0 |
| Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text Retrieval | Jul 29, 2022 | Cross-Modal RetrievalData Augmentation | —Unverified | 0 | 0 |
| Patch is Enough: Naturalistic Adversarial Patch against Vision-Language Pre-training Models | Oct 7, 2024 | Image to text | —Unverified | 0 | 0 |
| PiTL: Cross-modal Retrieval with Weakly-supervised Vision-language Pre-training via Prompting | Jul 14, 2023 | Cross-Modal RetrievalImage to text | —Unverified | 0 | 0 |
| RefineNet: Enhancing Text-to-Image Conversion with High-Resolution and Detail Accuracy through Hierarchical Transformers and Progressive Refinement | Dec 27, 2023 | Computational EfficiencyImage Generation | —Unverified | 0 | 0 |
| Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API | Oct 7, 2023 | Decoderdocument understanding | —Unverified | 0 | 0 |