| A Multimodal Memes Classification: A Survey and Open Research Issues | Sep 17, 2020 | ClassificationGeneral Classification | —Unverified | 0 | 0 |
| Diversity and Consistency: Exploring Visual Question-Answer Pair Generation | Nov 1, 2021 | DiversityQuestion Answering | —Unverified | 0 | 0 |
| Diversifying Joint Vision-Language Tokenization Learning | Jun 6, 2023 | Question AnsweringRepresentation Learning | —Unverified | 0 | 0 |
| Multimodal Representations for Teacher-Guided Compositional Visual Reasoning | Oct 24, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Multimodal Reranking for Knowledge-Intensive Visual Question Answering | Jul 17, 2024 | Answer GenerationQuestion Answering | —Unverified | 0 | 0 |
| American == White in Multimodal Language-and-Image AI | Jul 1, 2022 | Image CaptioningQuestion Answering | —Unverified | 0 | 0 |
| DistilDoc: Knowledge Distillation for Visually-Rich Document Applications | Jun 12, 2024 | document-image-classificationDocument Image Classification | —Unverified | 0 | 0 |
| Multimodal Transformer With a Low-Computational-Cost Guarantee | Feb 23, 2024 | Action RecognitionQuestion Answering | —Unverified | 0 | 0 |
| Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA | Jun 27, 2024 | General KnowledgeQuestion Answering | —Unverified | 0 | 0 |
| Multimodal Unified Attention Networks for Vision-and-Language Interactions | Aug 12, 2019 | Question AnsweringVisual Grounding | —Unverified | 0 | 0 |
| All You May Need for VQA are Image Captions | Jan 16, 2022 | AllImage Captioning | —Unverified | 0 | 0 |
| All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark | Feb 24, 2025 | AllMultimodal Reasoning | —Unverified | 0 | 0 |
| Discovering Pathology Rationale and Token Allocation for Efficient Multimodal Pathology Reasoning | May 21, 2025 | Computational EfficiencyDiagnostic | —Unverified | 0 | 0 |
| Directional Gradient Projection for Robust Fine-Tuning of Foundation Models | Feb 21, 2025 | image-classificationImage Classification | —Unverified | 0 | 0 |
| Vision-Language Pretraining: Current Trends and the Future | May 1, 2022 | Question AnsweringRepresentation Learning | —Unverified | 0 | 0 |
| DiN: Diffusion Model for Robust Medical VQA with Semantic Noisy Labels | Mar 24, 2025 | Medical Visual Question AnsweringQuestion Answering | —Unverified | 0 | 0 |
| Multi-task Learning of Hierarchical Vision-Language Representation | Dec 3, 2018 | Multi-Task LearningQuestion Answering | —Unverified | 0 | 0 |
| AlignVE: Visual Entailment Recognition Based on Alignment Relations | Nov 16, 2022 | Question AnsweringRelation | —Unverified | 0 | 0 |
| Vision LLMs Are Bad at Hierarchical Visual Understanding, and LLMs Are the Bottleneck | May 30, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| MUST-VQA: MUltilingual Scene-text VQA | Sep 14, 2022 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Alignment, Mining and Fusion: Representation Alignment with Hard Negative Mining and Selective Knowledge Fusion for Medical Visual Question Answering | Jan 1, 2025 | Contrastive LearningMedical Visual Question Answering | —Unverified | 0 | 0 |
| Differentiable End-to-End Program Executor for Sample and Computationally Efficient VQA | Jan 1, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| MuVAM: A Multi-View Attention-based Model for Medical Visual Question Answering | Jul 7, 2021 | Medical Visual Question AnsweringMissing Labels | —Unverified | 0 | 0 |
| MyVLM: Personalizing VLMs for User-Specific Queries | Mar 21, 2024 | Image CaptioningLanguage Modelling | —Unverified | 0 | 0 |
| Vision-to-Language Tasks Based on Attributes and Attention Mechanism | May 29, 2019 | Image CaptioningQuestion Answering | —Unverified | 0 | 0 |