| EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering | Dec 19, 2023 | ObjectObject Counting | CodeCode Available | 1 | 5 |
| Check It Again:Progressive Visual Question Answering via Visual Entailment | Aug 1, 2021 | Question AnsweringVisual Entailment | CodeCode Available | 1 | 5 |
| ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification | Apr 29, 2025 | DiagnosticQuestion Answering | CodeCode Available | 1 | 5 |
| GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection | Nov 5, 2023 | Anomaly DetectionQuestion Answering | CodeCode Available | 1 | 5 |
| Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering | Jun 16, 2023 | Image CaptioningQuestion Answering | CodeCode Available | 1 | 5 |
| CaMML: Context-Aware Multimodal Learner for Large Models | Jan 6, 2024 | Visual Question Answering | CodeCode Available | 1 | 5 |
| Pano-AVQA: Grounded Audio-Visual Question Answering on 360deg Videos | Jan 1, 2021 | Audio-visual Question AnsweringQuestion Answering | CodeCode Available | 1 | 5 |
| A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge | Jun 3, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 | 5 |
| Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping | Oct 11, 2024 | MMEQuestion Answering | CodeCode Available | 1 | 5 |
| Calibrating Concepts and Operations: Towards Symbolic Reasoning on Real Images | Oct 1, 2021 | Question AnsweringVisual Question Answering | CodeCode Available | 1 | 5 |
| FloodNet: A High Resolution Aerial Imagery Dataset for Post Flood Scene Understanding | Dec 5, 2020 | image-classificationImage Classification | CodeCode Available | 1 | 5 |
| MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting | Oct 13, 2022 | Image CaptioningQuestion Answering | CodeCode Available | 1 | 5 |
| Just Ask: Learning to Answer Questions from Millions of Narrated Videos | Dec 1, 2020 | Question AnsweringQuestion Generation | CodeCode Available | 1 | 5 |
| Expressive Scene Graph Generation Using Commonsense Knowledge Infusion for Visual Understanding and Reasoning | May 31, 2022 | Common Sense ReasoningGraph Generation | CodeCode Available | 1 | 5 |
| Dynamic Language Binding in Relational Visual Reasoning | Apr 30, 2020 | ObjectQuestion Answering | CodeCode Available | 1 | 5 |
| Florence: A New Foundation Model for Computer Vision | Nov 22, 2021 | Action ClassificationAction Recognition | CodeCode Available | 1 | 5 |
| Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy | Feb 11, 2024 | Language ModelingOpen Vocabulary Attribute Detection | CodeCode Available | 1 | 5 |
| Pano-AVQA: Grounded Audio-Visual Question Answering on 360^ Videos | Oct 11, 2021 | Audio-visual Question AnsweringQuestion Answering | CodeCode Available | 1 | 5 |
| FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs | Mar 27, 2025 | AttributeBenchmarking | CodeCode Available | 1 | 5 |
| Kosmos-2: Grounding Multimodal Large Language Models to the World | Jun 26, 2023 | Image CaptioningIn-Context Learning | CodeCode Available | 1 | 5 |
| STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering | Jun 28, 2024 | Medical DiagnosisMedical Question Answering | CodeCode Available | 1 | 5 |
| Faithful Multimodal Explanation for Visual Question Answering | Sep 8, 2018 | Explanatory Visual Question AnsweringQuestion Answering | CodeCode Available | 1 | 5 |
| Comprehensive Visual Question Answering on Point Clouds through Compositional Scene Manipulation | Dec 22, 2021 | Common Sense ReasoningQuestion Answering | CodeCode Available | 1 | 5 |
| OmniNet: A unified architecture for multi-modal multi-task learning | Jul 17, 2019 | Image CaptioningMulti-Task Learning | CodeCode Available | 0 | 5 |
| DVQA: Understanding Data Visualizations via Question Answering | Jan 24, 2018 | ArticlesChart Question Answering | CodeCode Available | 0 | 5 |
| OmniFusion Technical Report | Apr 9, 2024 | MM-VetTextVQA | CodeCode Available | 0 | 5 |
| DualVD: An Adaptive Dual Encoding Model for Deep Visual Understanding in Visual Dialogue | Nov 17, 2019 | feature selectionQuestion Answering | CodeCode Available | 0 | 5 |
| Dual Recurrent Attention Units for Visual Question Answering | Feb 1, 2018 | Question AnsweringVisual Question Answering | CodeCode Available | 0 | 5 |
| Bridging Vision and Language Spaces with Assignment Prediction | Apr 15, 2024 | Cross-Modal RetrievalImage Captioning | CodeCode Available | 0 | 5 |
| Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering | Mar 14, 2024 | Optical Character RecognitionOptical Character Recognition (OCR) | CodeCode Available | 0 | 5 |
| OG-SGG: Ontology-Guided Scene Graph Generation. A Case Study in Transfer Learning for Telepresence Robotics | Feb 21, 2022 | BIG-bench Machine LearningGraph Generation | CodeCode Available | 0 | 5 |
| On Modality Bias Recognition and Reduction | Feb 25, 2022 | Action RecognitionMulti-modal Classification | CodeCode Available | 0 | 5 |
| Dual Attention Networks for Visual Reference Resolution in Visual Dialog | Feb 25, 2019 | AI AgentQuestion Answering | CodeCode Available | 0 | 5 |
| Dual Attention Networks for Multimodal Reasoning and Matching | Nov 2, 2016 | Collaborative InferenceImage-text matching | CodeCode Available | 0 | 5 |
| Object Attribute Matters in Visual Question Answering | Dec 20, 2023 | AttributeGraph Neural Network | CodeCode Available | 0 | 5 |
| DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document Images | Jun 26, 2025 | document understandingOptical Character Recognition (OCR) | CodeCode Available | 0 | 5 |
| Object-aware Adaptive-Positivity Learning for Audio-Visual Question Answering | Dec 20, 2023 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 0 | 5 |
| Towards Flexible Evaluation for Generative Visual Question Answering | Aug 1, 2024 | DecoderGenerative Visual Question Answering | CodeCode Available | 0 | 5 |
| Answer Them All! Toward Universal Visual Question Answering Models | Mar 1, 2019 | AllQuestion Answering | CodeCode Available | 0 | 5 |
| Neural Module Networks | Nov 9, 2015 | Visual Question AnsweringVisual Question Answering (VQA) | CodeCode Available | 0 | 5 |
| Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding | Oct 4, 2018 | Question AnsweringRepresentation Learning | CodeCode Available | 0 | 5 |
| Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering | Dec 1, 2017 | Question AnsweringVisual Question Answering | CodeCode Available | 0 | 5 |
| Answer Questions with Right Image Regions: A Visual Attention Regularization Approach | Feb 3, 2021 | Question AnsweringVisual Grounding | CodeCode Available | 0 | 5 |
| Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering | Apr 22, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 | 5 |
| NeSyCoCo: A Neuro-Symbolic Concept Composer for Compositional Generalization | Dec 20, 2024 | Compositional Generalization (AVG)Novel Concepts | CodeCode Available | 0 | 5 |
| No Images, No Problem: Retaining Knowledge in Continual VQA with Questions-Only Memory | Feb 6, 2025 | Continual LearningQuestion Answering | CodeCode Available | 0 | 5 |
| MUTAN: Multimodal Tucker Fusion for Visual Question Answering | May 18, 2017 | Visual Question AnsweringVisual Question Answering (VQA) | CodeCode Available | 0 | 5 |
| Does Chain-of-Thought Reasoning Help Mobile GUI Agent? An Empirical Study | Mar 21, 2025 | AttributeMathematical Problem-Solving | CodeCode Available | 0 | 5 |
| Music's Multimodal Complexity in AVQA: Why We Need More than General Multimodal LLMs | May 27, 2025 | Audio-visual Question AnsweringQuestion Answering | CodeCode Available | 0 | 5 |
| Multi-Sourced Compositional Generalization in Visual Question Answering | May 29, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 0 | 5 |