| Hierarchical Gumbel Attention Network for Text-based Person Search | Oct 10, 2020 | Image RetrievalImage to text | —Unverified | 0 | 0 |
| HyCIR: Boosting Zero-Shot Composed Image Retrieval with Synthetic Labels | Jul 8, 2024 | Contrastive LearningImage Retrieval | —Unverified | 0 | 0 |
| I2T2I: Learning Text to Image Synthesis with Textual Data Augmentation | Mar 20, 2017 | Caption GenerationData Augmentation | —Unverified | 0 | 0 |
| Illegible Text to Readable Text: An Image-to-Image Transformation using Conditional Sliced Wasserstein Adversarial Networks | Oct 11, 2019 | Generative Adversarial NetworkImage-to-Image Translation | —Unverified | 0 | 0 |
| Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models | Nov 8, 2024 | Image CaptioningImage Generation | —Unverified | 0 | 0 |
| Image Captioners Sometimes Tell More Than Images They See | May 4, 2023 | DescriptiveImage Captioning | —Unverified | 0 | 0 |
| Image Semantic Relation Generation | Oct 19, 2022 | Image RetrievalImage Segmentation | —Unverified | 0 | 0 |
| Image-to-Text for Medical Reports Using Adaptive Co-Attention and Triple-LSTM Module | Mar 24, 2025 | Image to textMedical Report Generation | —Unverified | 0 | 0 |
| Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything | Jul 1, 2024 | Image to textLanguage Modeling | —Unverified | 0 | 0 |
| Improving Factuality of 3D Brain MRI Report Generation with Paired Image-domain Retrieval and Text-domain Augmentation | Nov 23, 2024 | Cross-Modal RetrievalImage to text | —Unverified | 0 | 0 |
| Improving Medical Visual Representation Learning with Pathological-level Cross-Modal Alignment and Correlation Exploration | Jun 12, 2025 | cross-modal alignmentImage to text | —Unverified | 0 | 0 |
| Improving Table Structure Recognition with Visual-Alignment Sequential Coordinate Modeling | Mar 13, 2023 | DecoderImage to text | —Unverified | 0 | 0 |
| Instruction Tuning-free Visual Token Complement for Multimodal LLMs | Aug 9, 2024 | Image GenerationImage to text | —Unverified | 0 | 0 |
| Interpreting Vision and Language Generative Models with Semantic Visual Priors | Apr 28, 2023 | Image to text | —Unverified | 0 | 0 |
| Is Cross-modal Information Retrieval Possible without Training? | Apr 20, 2023 | Contrastive LearningCross-Modal Information Retrieval | —Unverified | 0 | 0 |
| I See Dead People: Gray-Box Adversarial Attack on Image-To-Text Models | Jun 13, 2023 | Adversarial AttackDecoder | —Unverified | 0 | 0 |
| Knowledge Aware Semantic Concept Expansion for Image-Text Matching | Aug 10, 2019 | Common Sense ReasoningContent-Based Image Retrieval | —Unverified | 0 | 0 |
| Knowledge driven Description Synthesis for Floor Plan Interpretation | Mar 15, 2021 | Caption GenerationDescriptive | —Unverified | 0 | 0 |
| Semantically Grounded QFormer for Efficient Vision Language Understanding | Nov 13, 2023 | DiversityImage to text | —Unverified | 0 | 0 |
| Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision | Oct 24, 2022 | cross-modal alignmentCross-Modal Retrieval | —Unverified | 0 | 0 |
| Learning Deep Structure-Preserving Image-Text Embeddings | Nov 19, 2015 | Image RetrievalImage to text | —Unverified | 0 | 0 |
| Learning Pseudo-Labeler beyond Noun Concepts for Open-Vocabulary Object Detection | Dec 4, 2023 | Image to textobject-detection | —Unverified | 0 | 0 |
| Leveraging AI to Generate Audio for User-generated Content in Video Games | Apr 25, 2024 | Audio GenerationGame Design | —Unverified | 0 | 0 |
| Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency | Oct 5, 2023 | Image GenerationImage to text | —Unverified | 0 | 0 |
| MedM2G: Unifying Medical Multi-Modal Generation via Cross-Guided Diffusion with Visual Invariant | Mar 7, 2024 | Clinical KnowledgeImage to text | —Unverified | 0 | 0 |
| MFP-CLIP: Exploring the Efficacy of Multi-Form Prompts for Zero-Shot Industrial Anomaly Detection | Mar 17, 2025 | Anomaly DetectionForm | —Unverified | 0 | 0 |
| Category-Oriented Representation Learning for Image to Multi-Modal Retrieval | May 6, 2023 | Cross-Modal RetrievalImage Retrieval | —Unverified | 0 | 0 |
| Multilingual Image Corpus – Towards a Multimodal and Multilingual Dataset | Jun 1, 2022 | Caption Generationimage-classification | —Unverified | 0 | 0 |
| Multimodal Intelligence: Representation Learning, Information Fusion, and Applications | Nov 10, 2019 | Caption GenerationImage Generation | —Unverified | 0 | 0 |
| Multimodal Neurons in Pretrained Text-Only Transformers | Aug 3, 2023 | Image CaptioningImage to text | —Unverified | 0 | 0 |
| Natural Language Generation | Mar 20, 2025 | Image CaptioningImage to text | —Unverified | 0 | 0 |
| Natural Language Generation from Visual Sequences: Challenges and Future Directions | Feb 18, 2025 | Image to textText Generation | —Unverified | 0 | 0 |
| Offline Detection of Misspelled Handwritten Words by Convolving Recognition Model Features with Text Labels | Sep 18, 2023 | Generative Adversarial NetworkHandwriting Recognition | —Unverified | 0 | 0 |
| On the Importance of Text Preprocessing for Multimodal Representation Learning and Pathology Report Generation | Feb 26, 2025 | Cross-Modal RetrievalHallucination | —Unverified | 0 | 0 |
| OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation | Apr 1, 2024 | Image SegmentationImage to text | —Unverified | 0 | 0 |
| Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text Retrieval | Jul 29, 2022 | Cross-Modal RetrievalData Augmentation | —Unverified | 0 | 0 |
| Patch is Enough: Naturalistic Adversarial Patch against Vision-Language Pre-training Models | Oct 7, 2024 | Image to text | —Unverified | 0 | 0 |
| PiTL: Cross-modal Retrieval with Weakly-supervised Vision-language Pre-training via Prompting | Jul 14, 2023 | Cross-Modal RetrievalImage to text | —Unverified | 0 | 0 |
| RefineNet: Enhancing Text-to-Image Conversion with High-Resolution and Detail Accuracy through Hierarchical Transformers and Progressive Refinement | Dec 27, 2023 | Computational EfficiencyImage Generation | —Unverified | 0 | 0 |
| Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API | Oct 7, 2023 | Decoderdocument understanding | —Unverified | 0 | 0 |
| Reminding Multimodal Large Language Models of Object-aware Knowledge with Retrieved Tags | Jun 16, 2024 | Image to textInstruction Following | —Unverified | 0 | 0 |
| Retaining Knowledge and Enhancing Long-Text Representations in CLIP through Dual-Teacher Distillation | Jan 1, 2025 | image-classificationImage Classification | —Unverified | 0 | 0 |
| Retrieval-Augmented Multimodal Language Modeling | Nov 22, 2022 | Caption GenerationImage Captioning | —Unverified | 0 | 0 |
| Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning | Feb 9, 2023 | Few-Shot LearningImage Captioning | —Unverified | 0 | 0 |
| Revisiting DETR Pre-training for Object Detection | Aug 2, 2023 | Image to textObject | —Unverified | 0 | 0 |
| Robotic Environmental State Recognition with Pre-Trained Vision-Language Models and Black-Box Optimization | Sep 26, 2024 | Image to textImage-to-Text Retrieval | —Unverified | 0 | 0 |