| Gemini Pro Defeated by GPT-4V: Evidence from Education | Dec 27, 2023 | image-classificationImage Classification | —Unverified | 0 | 0 |
| A Picture May Be Worth a Hundred Words for Visual Question Answering | Jun 25, 2021 | Data AugmentationDescriptive | —Unverified | 0 | 0 |
| A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models | Feb 28, 2024 | Image DescriptionQuestion Answering | —Unverified | 0 | 0 |
| UNITER: Learning UNiversal Image-TExt Representations | Sep 25, 2019 | Image-text matchingImage-text Retrieval | —Unverified | 0 | 0 |
| ISAAQ -- Mastering Textbook Questions with Pre-trained Transformers and Bottom-Up and Top-Down Attention | Oct 1, 2020 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| ISAAQ - Mastering Textbook Questions with Pre-trained Transformers and Bottom-Up and Top-Down Attention | Nov 1, 2020 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| Is GPT-3 all you need for Visual Question Answering in Cultural Heritage? | Jul 25, 2022 | AllQuestion Answering | —Unverified | 0 | 0 |
| AntiGrounding: Lifting Robotic Actions into VLM Representation Space for Decision Making | Jun 14, 2025 | Decision MakingQuestion Answering | —Unverified | 0 | 0 |
| Iterated learning for emergent systematicity in VQA | May 3, 2021 | Question AnsweringSystematic Generalization | —Unverified | 0 | 0 |
| It Takes Two to Tango: Towards Theory of AI's Mind | Apr 3, 2017 | AttributeQuestion Answering | —Unverified | 0 | 0 |
| iVQA: Inverse Visual Question Answering | Oct 10, 2017 | Question AnsweringQuestion Generation | —Unverified | 0 | 0 |
| Jaeger: A Concatenation-Based Multi-Transformer VQA Model | Oct 11, 2023 | Dimensionality Reductionmodel | —Unverified | 0 | 0 |
| GEMeX-ThinkVG: Towards Thinking with Visual Grounding in Medical VQA via Reinforcement Learning | Jun 22, 2025 | Answer GenerationDecision Making | —Unverified | 0 | 0 |
| GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis | Nov 25, 2024 | Medical Visual Question AnsweringMultiple-choice | —Unverified | 0 | 0 |
| GC-KBVQA: A New Four-Stage Framework for Enhancing Knowledge Based Visual Question Answering Performance | May 25, 2025 | Caption GenerationQuestion Answering | —Unverified | 0 | 0 |
| Gamified crowd-sourcing of high-quality data for visual fine-tuning | Oct 5, 2024 | Visual Question Answering | —Unverified | 0 | 0 |
| JEEM: Vision-Language Understanding in Four Arabic Dialects | Mar 27, 2025 | Image CaptioningQuestion Answering | —Unverified | 0 | 0 |
| Un jeu de données pour répondre à des questions visuelles à propos d’entités nommées en utilisant des bases de connaissances (ViQuAE, a Dataset for Knowledge-based Visual Question Answering about Named Entities) | Jun 1, 2022 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Joint learning of object graph and relation graph for visual question answering | May 9, 2022 | AttributeGraph Neural Network | —Unverified | 0 | 0 |
| Jointly Learning Truth-Conditional Denotations and Groundings using Parallel Attention | Apr 14, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Zero-shot Action Localization via the Confidence of Large Vision-Language Models | Oct 18, 2024 | Action LocalizationLanguage Modelling | —Unverified | 0 | 0 |
| FVQA: Fact-based Visual Question Answering | Jun 17, 2016 | Common Sense ReasoningQuestion Answering | —Unverified | 0 | 0 |
| JTD-UAV: MLLM-Enhanced Joint Tracking and Description Framework for Anti-UAV Systems | Jan 1, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| FVQA 2.0: Introducing Adversarial Samples into Fact-based Visual Question Answering | Mar 19, 2023 | Common Sense ReasoningInformation Retrieval | —Unverified | 0 | 0 |
| Unleashing the Potential of Large Language Model: Zero-shot VQA for Flood Disaster Scenario | Dec 4, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| `Just because you are right, doesn't mean I am wrong': Overcoming a bottleneck in development and evaluation of Open-Ended VQA tasks | Apr 1, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| KAnoCLIP: Zero-Shot Anomaly Detection through Knowledge-Driven Prompt Learning and Enhanced Cross-Modal Integration | Jan 7, 2025 | Anomaly DetectionAnomaly Segmentation | —Unverified | 0 | 0 |
| Fusion of Domain-Adapted Vision and Language Models for Medical Visual Question Answering | Apr 24, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Kernel Pooling for Convolutional Neural Networks | Jul 1, 2017 | Face RecognitionFine-Grained Visual Categorization | —Unverified | 0 | 0 |
| Zero-Shot Anomaly Detection in Battery Thermal Images Using Visual Question Answering with Prior Knowledge | May 22, 2025 | Anomaly DetectionQuestion Answering | —Unverified | 0 | 0 |
| Generating and Evaluating Explanations of Attended and Error-Inducing Input Regions for VQA Models | Mar 26, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Knowing Where to Look? Analysis on Attention of Visual Question Answering System | Oct 9, 2018 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Unshuffling Data for Improved Generalization | Feb 27, 2020 | ClusteringData Augmentation | —Unverified | 0 | 0 |
| Knowledge Acquisition for Visual Question Answering via Iterative Querying | Jul 1, 2017 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Knowledge-Augmented Language Models Interpreting Structured Chest X-Ray Findings | May 3, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Knowledge-Based Counterfactual Queries for Visual Question Answering | Mar 5, 2023 | counterfactualDecision Making | —Unverified | 0 | 0 |
| Knowledge-Based Visual Question Answering in Videos | Apr 17, 2020 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Knowledge Condensation and Reasoning for Knowledge-based VQA | Mar 15, 2024 | Question AnsweringReading Comprehension | —Unverified | 0 | 0 |
| Knowledge Detection by Relevant Question and Image Attributes in Visual Question Answering | Jun 8, 2023 | Question AnsweringRetrieval | —Unverified | 0 | 0 |
| Unshuffling Data for Improved Generalization in Visual Question Answering | Jan 1, 2021 | Out-of-Distribution GeneralizationQuestion Answering | —Unverified | 0 | 0 |
| Fusion of Detected Objects in Text for Visual Question Answering | Aug 14, 2019 | Question AnsweringVisual Commonsense Reasoning | —Unverified | 0 | 0 |
| FunBench: Benchmarking Fundus Reading Skills of MLLMs | Mar 2, 2025 | AnatomyBenchmarking | —Unverified | 0 | 0 |
| Answer-Type Prediction for Visual Question Answering | Jun 1, 2016 | Object RecognitionPrediction | —Unverified | 0 | 0 |
| KOSMOS-2.5: A Multimodal Literate Model | Sep 20, 2023 | document understandingmodel | —Unverified | 0 | 0 |
| From Text to Visuals: Using LLMs to Generate Math Diagrams with Vector Graphics | Mar 10, 2025 | MathQuestion Answering | —Unverified | 0 | 0 |
| Unsupervised Keyword Extraction for Full-sentence VQA | Nov 23, 2019 | Keyword ExtractionQuestion Answering | —Unverified | 0 | 0 |
| Unveiling Cross Modality Bias in Visual Question Answering: A Causal View with Possible Worlds VQA | May 31, 2023 | counterfactualCounterfactual Inference | —Unverified | 0 | 0 |
| KVQA: Knowledge-Aware Visual Question Answering | Jul 17, 2019 | Knowledge GraphsQuestion Answering | —Unverified | 0 | 0 |
| From Strings to Things: Knowledge-Enabled VQA Model That Can Read and Reason | Oct 1, 2019 | Graph Neural NetworkQuestion Answering | —Unverified | 0 | 0 |
| From Shallow to Deep: Compositional Reasoning over Graphs for Visual Question Answering | Jun 25, 2022 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |