| MANGO: Enhancing the Robustness of VQA Models via Adversarial Noise Generation | Jan 16, 2022 | Logical ReasoningQuestion Answering | —Unverified | 0 | 0 |
| Explicit Knowledge-based Reasoning for Visual Question Answering | Nov 9, 2015 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Video Question Answering via Attribute-Augmented Attention Network Learning | Jul 20, 2017 | AttributeInformation Retrieval | —Unverified | 0 | 0 |
| Explicit Bias Discovery in Visual Question Answering Models | Nov 19, 2018 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA | Nov 19, 2019 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey | Dec 3, 2024 | Cross-Modal RetrievalNatural Language Understanding | —Unverified | 0 | 0 |
| Anatomy Might Be All You Need: Forecasting What to Do During Surgery | Jan 29, 2025 | AllAnatomy | —Unverified | 0 | 0 |
| Mask4Align: Aligned Entity Prompting with Color Masks for Multi-Entity Localization Problems | Jan 1, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering | Dec 19, 2022 | Chart Question AnsweringData Summarization | —Unverified | 0 | 0 |
| Expanding the Boundaries of Vision Prior Knowledge in Multi-modal Large Language Models | Mar 23, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model | Aug 22, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling | Dec 6, 2024 | document understandingHallucination | —Unverified | 0 | 0 |
| Expanding Frozen Vision-Language Models without Retraining: Towards Improved Robot Perception | Aug 31, 2023 | Activity RecognitionHuman Activity Recognition | —Unverified | 0 | 0 |
| Everything is a Video: Unifying Modalities through Next-Frame Prediction | Nov 15, 2024 | Caption GenerationCross-Modal Retrieval | —Unverified | 0 | 0 |
| EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE | Aug 23, 2023 | Image-text matchingImage-text Retrieval | —Unverified | 0 | 0 |
| Evaluation and Enhancement of Semantic Grounding in Large Vision-Language Models | Sep 7, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Measuring CLEVRness: Black-box Testing of Visual Reasoning Models | Sep 29, 2021 | BenchmarkingDiagnostic | —Unverified | 0 | 0 |
| Measuring CLEVRness: Blackbox testing of Visual Reasoning Models | Feb 24, 2022 | BenchmarkingDiagnostic | —Unverified | 0 | 0 |
| VILA^2: VILA Augmented VILA | Jul 24, 2024 | HallucinationOptical Character Recognition (OCR) | —Unverified | 0 | 0 |
| Measuring Machine Intelligence Through Visual Question Answering | Aug 31, 2016 | Image CaptioningQuestion Answering | —Unverified | 0 | 0 |
| Med-2E3: A 2D-Enhanced 3D Medical Multimodal Large Language Model | Nov 19, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Evaluating Zero-Shot GPT-4V Performance on 3D Visual Question Answering Benchmarks | May 29, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Evaluating the Representational Hub of Language and Vision Models | Apr 12, 2019 | DiagnosticQuestion Answering | —Unverified | 0 | 0 |
| Evaluating the Capabilities of Multi-modal Reasoning Models with Synthetic Task Data | Jun 1, 2023 | Anomaly DetectionImage Generation | —Unverified | 0 | 0 |
| Evaluating Attribute Confusion in Fashion Text-to-Image Generation | Jul 9, 2025 | Attributecross-modal alignment | —Unverified | 0 | 0 |