| Analyzing the Behavior of Visual Question Answering Models | Jun 23, 2016 | Question AnsweringVisual Question Answering | CodeCode Available | 0 | 5 |
| CROPE: Evaluating In-Context Adaptation of Vision and Language Models to Culture-Specific Concepts | Oct 20, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 | 5 |
| NAAQA: A Neural Architecture for Acoustic Question Answering | Jun 11, 2021 | Acoustic Question AnsweringQuestion Answering | CodeCode Available | 0 | 5 |
| NeSyCoCo: A Neuro-Symbolic Concept Composer for Compositional Generalization | Dec 20, 2024 | Compositional Generalization (AVG)Novel Concepts | CodeCode Available | 0 | 5 |
| Music's Multimodal Complexity in AVQA: Why We Need More than General Multimodal LLMs | May 27, 2025 | Audio-visual Question AnsweringQuestion Answering | CodeCode Available | 0 | 5 |
| MUTAN: Multimodal Tucker Fusion for Visual Question Answering | May 18, 2017 | Visual Question AnsweringVisual Question Answering (VQA) | CodeCode Available | 0 | 5 |
| 12-in-1: Multi-Task Vision and Language Representation Learning | Dec 5, 2019 | 10-shot image generationImage Retrieval | CodeCode Available | 0 | 5 |
| Neural Module Networks | Nov 9, 2015 | Visual Question AnsweringVisual Question Answering (VQA) | CodeCode Available | 0 | 5 |
| Open-Set Knowledge-Based Visual Question Answering with Inference Paths | Oct 12, 2023 | Knowledge GraphsMulti-class Classification | CodeCode Available | 0 | 5 |
| Automatic Generation of Contrast Sets from Scene Graphs: Probing the Compositional Consistency of GQA | Mar 17, 2021 | Question AnsweringRelational Reasoning | CodeCode Available | 0 | 5 |
| Multi-Page Document Visual Question Answering using Self-Attention Scoring Mechanism | Apr 29, 2024 | document understandingGPU | CodeCode Available | 0 | 5 |
| Multimodal Residual Learning for Visual QA | Jun 5, 2016 | Multiple-choiceQuestion Answering | CodeCode Available | 0 | 5 |
| Multiple interaction learning with question-type prior knowledge for constraining answer search space in visual question answering | Sep 23, 2020 | Question AnsweringVisual Question Answering | CodeCode Available | 0 | 5 |
| Counting Everyday Objects in Everyday Scenes | Apr 12, 2016 | ObjectObject Counting | CodeCode Available | 0 | 5 |
| AutoBench-V: Can Large Vision-Language Models Benchmark Themselves? | Oct 28, 2024 | BenchmarkingQuestion Answering | CodeCode Available | 0 | 5 |
| Adaptively Clustering Neighbor Elements for Image-Text Generation | Jan 5, 2023 | ClusteringDecoder | CodeCode Available | 0 | 5 |
| A Unified Hallucination Mitigation Framework for Large Vision-Language Models | Sep 24, 2024 | HallucinationQuestion Answering | CodeCode Available | 0 | 5 |
| Core Tokensets for Data-efficient Sequential Training of Transformers | Oct 8, 2024 | Image Captioningimage-classification | CodeCode Available | 0 | 5 |
| Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond | Oct 8, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 | 5 |
| Copy-Move Forgery Detection and Question Answering for Remote Sensing Image | Dec 3, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 | 5 |
| Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering | Aug 4, 2017 | Question AnsweringVisual Question Answering | CodeCode Available | 0 | 5 |
| Grad-CAM: Why did you say that? | Nov 22, 2016 | Image CaptioningVisual Question Answering | CodeCode Available | 0 | 5 |
| Convincing Rationales for Visual Question Answering Reasoning | Feb 6, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 | 5 |
| Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding | Jun 6, 2016 | Phrase GroundingVisual Grounding | CodeCode Available | 0 | 5 |
| Continual VQA for Disaster Response Systems | Sep 21, 2022 | Disaster ResponseManagement | CodeCode Available | 0 | 5 |