| How Modular Should Neural Module Networks Be for Systematic Generalization? | Jun 15, 2021 | Question AnsweringSystematic Generalization | CodeCode Available | 0 |
| Targeted Visual Prompting for Medical Visual Question Answering | Aug 6, 2024 | Medical Visual Question AnsweringQuestion Answering | CodeCode Available | 0 |
| Self Supervision for Attention Networks | Jan 6, 2021 | image-classificationImage Classification | CodeCode Available | 0 |
| VQA Therapy: Exploring Answer Differences by Visually Grounding Answers | Aug 21, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| UMIT: Unifying Medical Imaging Tasks via Vision-Language Models | Mar 20, 2025 | DiagnosticMedical Image Analysis | CodeCode Available | 0 |
| Semantically Equivalent Adversarial Rules for Debugging NLP models | Jul 1, 2018 | Data AugmentationQuestion Answering | CodeCode Available | 0 |
| Alignment Attention by Matching Key and Query Distributions | Oct 25, 2021 | Graph AttentionQuestion Answering | CodeCode Available | 0 |
| UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models | Oct 17, 2023 | AttributeQuestion Answering | CodeCode Available | 0 |
| Deep Modular Co-Attention Networks for Visual Question Answering | Jun 25, 2019 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| High-Order Attention Models for Visual Question Answering | Nov 12, 2017 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| 12-in-1: Multi-Task Vision and Language Representation Learning | Dec 5, 2019 | 10-shot image generationImage Retrieval | CodeCode Available | 0 |
| Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations | May 15, 2019 | Image CaptioningQuestion Answering | CodeCode Available | 0 |
| Separate and Locate: Rethink the Text in Text-based Visual Question Answering | Aug 31, 2023 | Optical Character Recognition (OCR)Position | CodeCode Available | 0 |
| Hierarchical Deep Multi-modal Network for Medical Visual Question Answering | Sep 27, 2020 | DescriptiveMedical Visual Question Answering | CodeCode Available | 0 |
| Visual Question Answering: Datasets, Algorithms, and Future Challenges | Oct 5, 2016 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Are VLMs Really Blind | Oct 29, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Visual Question Answering From Another Perspective: CLEVR Mental Rotation Tests | Dec 3, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| ShapeWorld - A new test methodology for multimodal language understanding | Apr 14, 2017 | Multimodal Deep LearningVisual Question Answering | CodeCode Available | 0 |
| ShareGPT4V: Improving Large Multi-Modal Models with Better Captions | Nov 21, 2023 | DescriptiveMME | CodeCode Available | 0 |
| Help Me Identify: Is an LLM+VQA System All We Need to Identify Visual Concepts? | Oct 17, 2024 | AllLanguage Modeling | CodeCode Available | 0 |
| Uncovering Hidden Connections: Iterative Search and Reasoning for Video-grounded Dialog | Oct 11, 2023 | Question AnsweringResponse Generation | CodeCode Available | 0 |
| Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering | Apr 11, 2017 | Visual Question AnsweringVisual Question Answering (VQA) | CodeCode Available | 0 |
| HaVQA: A Dataset for Visual Question Answering and Multimodal Research in Hausa Language | May 28, 2023 | Machine TranslationMultimodal Machine Translation | CodeCode Available | 0 |
| HALLUCINOGEN: A Benchmark for Evaluating Object Hallucination in Large Visual-Language Models | Dec 29, 2024 | HallucinationObject | CodeCode Available | 0 |
| Uncovering the Full Potential of Visual Grounding Methods in VQA | Jan 15, 2024 | Question AnsweringVisual Grounding | CodeCode Available | 0 |
| Siamese Tracking with Lingual Object Constraints | Nov 23, 2020 | ObjectObject Tracking | CodeCode Available | 0 |
| World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering | Sep 30, 2024 | Optical Character Recognition (OCR)Question Answering | CodeCode Available | 0 |
| VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation | Aug 15, 2017 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| SilVar: Speech Driven Multimodal Model for Reasoning Visual Question Answering and Object Localization | Dec 21, 2024 | Image CaptioningMultimodal Reasoning | CodeCode Available | 0 |
| Sim2Real Transfer for Vision-Based Grasp Verification | May 5, 2025 | Objectobject-detection | CodeCode Available | 0 |
| Hallucination Benchmark in Medical Visual Question Answering | Jan 11, 2024 | HallucinationMedical Visual Question Answering | CodeCode Available | 0 |
| Simple Baseline for Visual Question Answering | Dec 7, 2015 | Visual Question AnsweringVisual Question Answering (VQA) | CodeCode Available | 0 |
| HalLoc: Token-level Localization of Hallucinations for Vision Language Models | Jun 12, 2025 | HallucinationImage Captioning | CodeCode Available | 0 |
| Understanding Attention for Vision-and-Language Tasks | Aug 17, 2022 | Image GenerationImage Retrieval | CodeCode Available | 0 |
| Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding | Apr 20, 2025 | Autonomous DrivingImage Captioning | CodeCode Available | 0 |
| A Question-Centric Model for Visual Question Answering in Medical Imaging | Mar 2, 2020 | Medical Image AnalysisQuestion Answering | CodeCode Available | 0 |
| Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering | Mar 14, 2024 | Optical Character RecognitionOptical Character Recognition (OCR) | CodeCode Available | 0 |
| HAIBU-ReMUD: Reasoning Multimodal Ultrasound Dataset and Model Bridging to General Specific Domains | Jun 9, 2025 | DiagnosticQuestion Answering | CodeCode Available | 0 |
| Applying recent advances in Visual Question Answering to Record Linkage | Jul 12, 2020 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering | Nov 17, 2024 | HallucinationIn-Context Learning | CodeCode Available | 0 |
| Single-Stream Multi-Level Alignment for Vision-Language Pretraining | Mar 27, 2022 | Image-text RetrievalQuestion Answering | CodeCode Available | 0 |
| VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning | Mar 5, 2023 | Answer GenerationEntity Alignment | CodeCode Available | 0 |
| Hadamard Product for Low-rank Bilinear Pooling | Oct 14, 2016 | Visual Question AnsweringVisual Question Answering (VQA) | CodeCode Available | 0 |
| Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types | Sep 14, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Grounding Answers for Visual Questions Asked by Visually Impaired People | Feb 4, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Grad-CAM: Why did you say that? | Nov 22, 2016 | Image CaptioningVisual Question Answering | CodeCode Available | 0 |
| Generalizing Visual Question Answering from Synthetic to Human-Written Questions via a Chain of QA with a Large Language Model | Jan 12, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| SlotPi: Physics-informed Object-centric Reasoning Models | Jun 12, 2025 | ObjectQuestion Answering | CodeCode Available | 0 |
| Understanding the World's Museums through Vision-Language Reasoning | Dec 2, 2024 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| Declarative Knowledge Distillation from Large Language Models for Visual Question Answering Datasets | Oct 12, 2024 | Knowledge DistillationQuestion Answering | CodeCode Available | 0 |