| Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models | Aug 8, 2024 | Contrastive LearningFine-Grained Image Recognition | —Unverified | 0 |
| Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA | Jun 27, 2024 | General KnowledgeQuestion Answering | —Unverified | 0 |
| Adventurer's Treasure Hunt: A Transparent System for Visually Grounded Compositional Visual Question Answering based on Scene Graphs | Jun 28, 2021 | Question AnsweringTask 2 | —Unverified | 0 |
| Image Captioning with Compositional Neural Module Networks | Jul 10, 2020 | Image CaptioningQuestion Answering | —Unverified | 0 |
| Answer-checking in Context: A Multi-modal FullyAttention Network for Visual Question Answering | Oct 17, 2020 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Image Manipulation via Multi-Hop Instructions -- A New Dataset and Weakly-Supervised Neuro-Symbolic Approach | May 23, 2023 | Image ManipulationQuestion Answering | —Unverified | 0 |
| Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models | May 22, 2024 | Multimodal ReasoningVisual Question Answering | —Unverified | 0 |
| Discovering Pathology Rationale and Token Allocation for Efficient Multimodal Pathology Reasoning | May 21, 2025 | Computational EfficiencyDiagnostic | —Unverified | 0 |
| Directional Gradient Projection for Robust Fine-Tuning of Foundation Models | Feb 21, 2025 | image-classificationImage Classification | —Unverified | 0 |
| DiN: Diffusion Model for Robust Medical VQA with Semantic Noisy Labels | Mar 24, 2025 | Medical Visual Question AnsweringQuestion Answering | —Unverified | 0 |
| A Novel Framework for Robustness Analysis of Visual QA Models | Nov 16, 2017 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Image Captioning and Visual Question Answering Based on Attributes and External Knowledge | Mar 9, 2016 | General KnowledgeImage Captioning | —Unverified | 0 |
| Image Position Prediction in Multimodal Documents | May 1, 2020 | ArticlesCaption Generation | —Unverified | 0 |
| ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance | Dec 9, 2024 | Image GenerationLanguage Modeling | —Unverified | 0 |
| Beyond VQA: Generating Multi-word Answer and Rationale to Visual Questions | Oct 24, 2020 | General ClassificationMultiple-choice | —Unverified | 0 |
| Differentiable End-to-End Program Executor for Sample and Computationally Efficient VQA | Jan 1, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| A Novel Attention-based Aggregation Function to Combine Vision and Language | Apr 27, 2020 | General ClassificationImage Captioning | —Unverified | 0 |
| Beyond the Hype: A dispassionate look at vision-language models in medical scenario | Aug 16, 2024 | Question AnsweringSpatial Reasoning | —Unverified | 0 |
| Advancing Surgical VQA with Scene Graph Knowledge | Dec 15, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Detection-based Intermediate Supervision for Visual Question Answering | Dec 26, 2023 | cross-modal alignmentLogical Reasoning | —Unverified | 0 |
| Beyond the Frame: Generating 360° Panoramic Videos from Perspective Videos | Apr 10, 2025 | Question AnsweringVideo Generation | —Unverified | 0 |
| An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models | Jun 10, 2025 | Action GenerationImage Captioning | —Unverified | 0 |
| CLIPPO: Image-and-Language Understanding from Pixels Only | Dec 15, 2022 | Contrastive Learningimage-classification | —Unverified | 0 |
| Detecting and Evaluating Medical Hallucinations in Large Vision Language Models | Jun 14, 2024 | HallucinationMedical Visual Question Answering | —Unverified | 0 |
| Detect, Describe, Discriminate: Moving Beyond VQA for MLLM Evaluation | Sep 23, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |