| MANGO: Enhancing the Robustness of VQA Models via Adversarial Noise Generation | Jan 16, 2022 | Logical ReasoningQuestion Answering | —Unverified | 0 | 0 |
| Explicit Knowledge-based Reasoning for Visual Question Answering | Nov 9, 2015 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Video Question Answering via Attribute-Augmented Attention Network Learning | Jul 20, 2017 | AttributeInformation Retrieval | —Unverified | 0 | 0 |
| Explicit Bias Discovery in Visual Question Answering Models | Nov 19, 2018 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA | Nov 19, 2019 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey | Dec 3, 2024 | Cross-Modal RetrievalNatural Language Understanding | —Unverified | 0 | 0 |
| Anatomy Might Be All You Need: Forecasting What to Do During Surgery | Jan 29, 2025 | AllAnatomy | —Unverified | 0 | 0 |
| Mask4Align: Aligned Entity Prompting with Color Masks for Multi-Entity Localization Problems | Jan 1, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering | Dec 19, 2022 | Chart Question AnsweringData Summarization | —Unverified | 0 | 0 |
| Expanding the Boundaries of Vision Prior Knowledge in Multi-modal Large Language Models | Mar 23, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model | Aug 22, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling | Dec 6, 2024 | document understandingHallucination | —Unverified | 0 | 0 |
| Expanding Frozen Vision-Language Models without Retraining: Towards Improved Robot Perception | Aug 31, 2023 | Activity RecognitionHuman Activity Recognition | —Unverified | 0 | 0 |
| Everything is a Video: Unifying Modalities through Next-Frame Prediction | Nov 15, 2024 | Caption GenerationCross-Modal Retrieval | —Unverified | 0 | 0 |
| EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE | Aug 23, 2023 | Image-text matchingImage-text Retrieval | —Unverified | 0 | 0 |
| Evaluation and Enhancement of Semantic Grounding in Large Vision-Language Models | Sep 7, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Measuring CLEVRness: Black-box Testing of Visual Reasoning Models | Sep 29, 2021 | BenchmarkingDiagnostic | —Unverified | 0 | 0 |
| Measuring CLEVRness: Blackbox testing of Visual Reasoning Models | Feb 24, 2022 | BenchmarkingDiagnostic | —Unverified | 0 | 0 |
| VILA^2: VILA Augmented VILA | Jul 24, 2024 | HallucinationOptical Character Recognition (OCR) | —Unverified | 0 | 0 |
| Measuring Machine Intelligence Through Visual Question Answering | Aug 31, 2016 | Image CaptioningQuestion Answering | —Unverified | 0 | 0 |
| Med-2E3: A 2D-Enhanced 3D Medical Multimodal Large Language Model | Nov 19, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Evaluating Zero-Shot GPT-4V Performance on 3D Visual Question Answering Benchmarks | May 29, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Evaluating the Representational Hub of Language and Vision Models | Apr 12, 2019 | DiagnosticQuestion Answering | —Unverified | 0 | 0 |
| Evaluating the Capabilities of Multi-modal Reasoning Models with Synthetic Task Data | Jun 1, 2023 | Anomaly DetectionImage Generation | —Unverified | 0 | 0 |
| Evaluating Attribute Confusion in Fashion Text-to-Image Generation | Jul 9, 2025 | Attributecross-modal alignment | —Unverified | 0 | 0 |
| Estimating semantic structure for the VQA answer space | Jun 10, 2020 | General ClassificationQuestion Answering | —Unverified | 0 | 0 |
| ERNIE-UniX2: A Unified Cross-lingual Cross-modal Framework for Understanding and Generation | Nov 9, 2022 | Contrastive LearningDecoder | —Unverified | 0 | 0 |
| An Analysis of Visual Question Answering Algorithms | Mar 28, 2017 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Medical Visual Question Answering: A Survey | Nov 19, 2021 | Medical Visual Question AnsweringQuestion Answering | —Unverified | 0 | 0 |
| Medical visual question answering using joint self-supervised learning | Feb 25, 2023 | DecoderDiversity | —Unverified | 0 | 0 |
| ErgoChat: a Visual Query System for the Ergonomic Risk Assessment of Construction Workers | Dec 27, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 | 0 |
| Entity-Focused Dense Passage Retrieval for Outside-Knowledge Visual Question Answering | Oct 18, 2022 | Passage RetrievalQuestion Answering | —Unverified | 0 | 0 |
| Enhancing Visual Question Answering through Ranking-Based Hybrid Training and Multimodal Fusion | Aug 14, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| MedOrch: Medical Diagnosis with Tool-Augmented Reasoning Agents for Flexible Extensibility | May 30, 2025 | Decision MakingMedical Diagnosis | —Unverified | 0 | 0 |
| Analysis on Image Set Visual Question Answering | Mar 31, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling | Jul 8, 2025 | ArticlesMultimodal Reasoning | —Unverified | 0 | 0 |
| MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale | Apr 18, 2024 | Decision MakingMedical Visual Question Answering | —Unverified | 0 | 0 |
| MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning | Feb 26, 2025 | Domain GeneralizationMedical Image Analysis | —Unverified | 0 | 0 |
| MedXChat: A Unified Multimodal Large Language Model Framework towards CXRs Understanding and Generation | Dec 4, 2023 | Instruction FollowingLanguage Modeling | —Unverified | 0 | 0 |
| MEGC2025: Micro-Expression Grand Challenge on Spot Then Recognize and Visual Question Answering | Jun 18, 2025 | Multimodal ReasoningQuestion Answering | —Unverified | 0 | 0 |
| Enhancing SAM with Efficient Prompting and Preference Optimization for Semi-supervised Medical Image Segmentation | Mar 6, 2025 | Active LearningImage Segmentation | —Unverified | 0 | 0 |
| Memory-Augmented Multimodal LLMs for Surgical VQA via Self-Contained Inquiry | Nov 17, 2024 | Question AnsweringScene Understanding | —Unverified | 0 | 0 |
| Memory Augmented Neural Networks for Natural Language Processing | Sep 1, 2017 | AI AgentLanguage Modeling | —Unverified | 0 | 0 |
| Merlin:Empowering Multimodal LLMs with Foresight Minds | Nov 30, 2023 | Visual Question Answering | —Unverified | 0 | 0 |
| Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering | Jun 7, 2025 | In-Context LearningMeta-Learning | —Unverified | 0 | 0 |
| MetaToken: Detecting Hallucination in Image Descriptions by Meta Classification | May 29, 2024 | HallucinationImage Captioning | —Unverified | 0 | 0 |
| From Training-Free to Adaptive: Empirical Insights into MLLMs' Understanding of Detection Information | Jan 31, 2024 | Hallucinationobject-detection | —Unverified | 0 | 0 |
| MF2-MVQA: A Multi-stage Feature Fusion method for Medical Visual Question Answering | Nov 11, 2022 | Medical Visual Question AnsweringQuestion Answering | —Unverified | 0 | 0 |
| Enhancing Human-Computer Interaction in Chest X-ray Analysis using Vision and Language Model with Eye Gaze Patterns | Apr 3, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| MGA-VQA: Multi-Granularity Alignment for Visual Question Answering | Jan 25, 2022 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |