| Language Guided Visual Question Answering: Elevate Your Multimodal Language Model Using Knowledge-Enriched Prompts | Oct 31, 2023 | Image CaptioningLanguage Modeling | CodeCode Available | 1 |
| Learning to Answer Visual Questions from Web Videos | May 10, 2022 | Dataset GenerationQuestion Answering | CodeCode Available | 1 |
| CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations | Apr 5, 2022 | Explanation GenerationQuestion Answering | CodeCode Available | 1 |
| Improving Selective Visual Question Answering by Learning from Your Peers | Jun 14, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| MISS: A Generative Pretraining and Finetuning Approach for Med-VQA | Jan 10, 2024 | Medical Visual Question AnsweringMulti-Task Learning | CodeCode Available | 1 |
| In Defense of Grid Features for Visual Question Answering | Jan 10, 2020 | Image CaptioningQuestion Answering | CodeCode Available | 1 |
| CLEVR-Math: A Dataset for Compositional Language, Visual and Mathematical Reasoning | Aug 10, 2022 | MathMathematical Reasoning | CodeCode Available | 1 |
| INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model | Jul 23, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models | Mar 23, 2024 | Common Sense ReasoningIn-Context Learning | CodeCode Available | 1 |
| IMPACT: A Large-scale Integrated Multimodal Patent Analysis and Creation Dataset for Design Patents | Dec 10, 2024 | Cross-Modal RetrievalImage Classification | CodeCode Available | 1 |
| CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning | Dec 20, 2016 | DiagnosticQuestion Answering | CodeCode Available | 1 |
| An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA | Sep 10, 2021 | Image CaptioningQuestion Answering | CodeCode Available | 1 |
| MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model | Jun 17, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Comprehensive Visual Question Answering on Point Clouds through Compositional Scene Manipulation | Dec 22, 2021 | Common Sense ReasoningQuestion Answering | CodeCode Available | 1 |
| Disentangling 3D Prototypical Networks For Few-Shot Concept Learning | Nov 6, 2020 | 3D geometry3D Object Detection | CodeCode Available | 1 |
| InfMLLM: A Unified Framework for Visual-Language Tasks | Nov 12, 2023 | GPUImage Captioning | CodeCode Available | 1 |
| I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision | Nov 17, 2022 | Image CaptioningQuestion Answering | CodeCode Available | 1 |
| I2I: Initializing Adapters with Improvised Knowledge | Apr 4, 2023 | Continual LearningQuestion Answering | CodeCode Available | 1 |
| Lever LM: Configuring In-Context Sequence to Lever Large Vision Language Models | Dec 15, 2023 | Image CaptioningIn-Context Learning | CodeCode Available | 1 |
| Hypergraph Transformer: Weakly-supervised Multi-hop Reasoning for Knowledge-based Visual Question Answering | Apr 22, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models | Dec 5, 2023 | BenchmarkingVisual Question Answering | CodeCode Available | 1 |
| MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering | Mar 17, 2022 | Implicit RelationsQuestion Answering | CodeCode Available | 1 |
| ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning | Oct 23, 2024 | Image CaptioningInstruction Following | CodeCode Available | 1 |
| Multi-Modal Answer Validation for Knowledge-Based VQA | Mar 23, 2021 | Question AnsweringRetrieval | CodeCode Available | 1 |
| HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models | Mar 20, 2024 | MMEVisual Question Answering | CodeCode Available | 1 |
| Multimodal ChatGPT for Medical Applications: an Experimental Study of GPT-4V | Oct 29, 2023 | DiagnosticLanguage Modeling | CodeCode Available | 1 |
| Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering | May 22, 2025 | BenchmarkingEvidence Selection | CodeCode Available | 1 |
| Debiased Visual Question Answering from Feature and Sample Perspectives | Dec 1, 2021 | Bias DetectionQuestion Answering | CodeCode Available | 1 |
| IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning | Oct 25, 2021 | Arithmetic ReasoningMathematical Question Answering | CodeCode Available | 1 |
| Declaration-based Prompt Tuning for Visual Question Answering | May 5, 2022 | Image-text matchingLanguage Modeling | CodeCode Available | 1 |
| How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game | Mar 13, 2025 | Multimodal ReasoningQuestion Answering | CodeCode Available | 1 |
| EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images | Oct 28, 2023 | Decision MakingMedical Visual Question Answering | CodeCode Available | 1 |
| Hierarchical Question-Image Co-Attention for Visual Question Answering | May 31, 2016 | Visual DialogVisual Question Answering | CodeCode Available | 1 |
| Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training | May 24, 2021 | Image CaptioningMedical Visual Question Answering | CodeCode Available | 1 |
| Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification | Jun 8, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering | Sep 18, 2020 | Out-of-Distribution GeneralizationQuestion Answering | CodeCode Available | 1 |
| Nearest Neighbor Normalization Improves Multimodal Retrieval | Oct 31, 2024 | Cross-Modal RetrievalImage Captioning | CodeCode Available | 1 |
| Notes-guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question Answering | Jan 1, 2025 | Large Language ModelMultimodal Large Language Model | CodeCode Available | 1 |
| How Much Can CLIP Benefit Vision-and-Language Tasks? | Jul 13, 2021 | Question AnsweringVision and Language Navigation | CodeCode Available | 1 |
| Distilled Dual-Encoder Model for Vision-Language Understanding | Dec 16, 2021 | Image to textmodel | CodeCode Available | 1 |
| ChiQA: A Large Scale Image-based Real-World Question Answering Dataset for Multi-Modal Understanding | Aug 5, 2022 | Image RetrievalQuestion Answering | CodeCode Available | 1 |
| Hierarchical multimodal transformers for Multi-Page DocVQA | Dec 7, 2022 | DecoderQuestion Answering | CodeCode Available | 1 |
| How to Configure Good In-Context Sequence for Visual Question Answering | Dec 4, 2023 | In-Context LearningQuestion Answering | CodeCode Available | 1 |
| IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages | Jan 27, 2022 | Cross-Modal RetrievalFew-Shot Learning | CodeCode Available | 1 |
| Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models | May 31, 2023 | Cross-Modal RetrievalQuestion Answering | CodeCode Available | 1 |
| ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification | Apr 29, 2025 | DiagnosticQuestion Answering | CodeCode Available | 1 |
| Check It Again:Progressive Visual Question Answering via Visual Entailment | Aug 1, 2021 | Question AnsweringVisual Entailment | CodeCode Available | 1 |
| Check It Again: Progressive Visual Question Answering via Visual Entailment | Jun 8, 2021 | Question AnsweringVisual Entailment | CodeCode Available | 1 |
| DeVLBert: Learning Deconfounded Visio-Linguistic Representations | Aug 16, 2020 | Image RetrievalQuestion Answering | CodeCode Available | 1 |
| HAAR: Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles | Dec 18, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |