| Hierarchical multimodal transformers for Multi-Page DocVQA | Dec 7, 2022 | DecoderQuestion Answering | CodeCode Available | 1 |
| CaMML: Context-Aware Multimodal Learner for Large Models | Jan 6, 2024 | Visual Question Answering | CodeCode Available | 1 |
| Hierarchical Question-Image Co-Attention for Visual Question Answering | May 31, 2016 | Visual DialogVisual Question Answering | CodeCode Available | 1 |
| How to Configure Good In-Context Sequence for Visual Question Answering | Dec 4, 2023 | In-Context LearningQuestion Answering | CodeCode Available | 1 |
| Mitigating Hallucinations in Vision-Language Models through Image-Guided Head Suppression | May 22, 2025 | HallucinationImage Description | CodeCode Available | 1 |
| MixGen: A New Multi-Modal Data Augmentation | Jun 16, 2022 | Data AugmentationImage-text Retrieval | CodeCode Available | 1 |
| A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge | Jun 3, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Florence: A New Foundation Model for Computer Vision | Nov 22, 2021 | Action ClassificationAction Recognition | CodeCode Available | 1 |
| Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping | Oct 11, 2024 | MMEQuestion Answering | CodeCode Available | 1 |
| HAAR: Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles | Dec 18, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Calibrating Concepts and Operations: Towards Symbolic Reasoning on Real Images | Oct 1, 2021 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Hallucination Augmented Contrastive Learning for Multimodal Large Language Model | Dec 12, 2023 | Contrastive LearningHallucination | CodeCode Available | 1 |
| Dynamic Language Binding in Relational Visual Reasoning | Apr 30, 2020 | ObjectQuestion Answering | CodeCode Available | 1 |
| Expressive Scene Graph Generation Using Commonsense Knowledge Infusion for Visual Understanding and Reasoning | May 31, 2022 | Common Sense ReasoningGraph Generation | CodeCode Available | 1 |
| Foundation Model is Efficient Multimodal Multitask Model Selector | Aug 11, 2023 | modelModel Selection | CodeCode Available | 1 |
| HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning | Jul 22, 2024 | BenchmarkingHallucination | CodeCode Available | 1 |
| IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models | Mar 23, 2024 | Common Sense ReasoningIn-Context Learning | CodeCode Available | 1 |
| Label-Descriptive Patterns and Their Application to Characterizing Classification Errors | Oct 18, 2021 | Descriptivenamed-entity-recognition | CodeCode Available | 1 |
| MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding | May 26, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification | Jun 8, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot Visual Question Answering | Mar 21, 2024 | object-detectionObject Detection | CodeCode Available | 1 |
| Faithful Multimodal Explanation for Visual Question Answering | Sep 8, 2018 | Explanatory Visual Question AnsweringQuestion Answering | CodeCode Available | 1 |
| Comprehensive Visual Question Answering on Point Clouds through Compositional Scene Manipulation | Dec 22, 2021 | Common Sense ReasoningQuestion Answering | CodeCode Available | 1 |
| Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering | Jun 1, 2019 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Dynamic Fusion with Intra- and Inter- Modality Attention Flow for Visual Question Answering | Dec 13, 2018 | Question AnsweringVisual Question Answering | —Unverified | 0 |