| Query and Attention Augmentation for Knowledge-Based Explainable Reasoning | Jan 1, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Dynamic Memory Networks for Visual and Textual Question Answering | Mar 4, 2016 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking | Apr 18, 2022 | cross-modal alignmentDocument AI | CodeCode Available | 0 |
| LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding | Dec 29, 2020 | Document Image ClassificationDocument Layout Analysis | CodeCode Available | 0 |
| An Improved Attention for Visual Question Answering | Nov 4, 2020 | DecoderQuestion Answering | CodeCode Available | 0 |
| TIJO: Trigger Inversion with Joint Optimization for Defending Multimodal Backdoored Models | Aug 7, 2023 | backdoor defenseobject-detection | CodeCode Available | 0 |
| TimeCausality: Evaluating the Causal Ability in Time Dimension for Vision Language Models | May 21, 2025 | Human AgingQuestion Answering | CodeCode Available | 0 |
| Adaptively Clustering Neighbor Elements for Image-Text Generation | Jan 5, 2023 | ClusteringDecoder | CodeCode Available | 0 |
| Dynamic Key-value Memory Enhanced Multi-step Graph Reasoning for Knowledge-based Visual Question Answering | Mar 6, 2022 | Graph AttentionQuestion Answering | CodeCode Available | 0 |
| DVQA: Understanding Data Visualizations via Question Answering | Jan 24, 2018 | ArticlesChart Question Answering | CodeCode Available | 0 |
| CLEVR\_HYP: A Challenge Dataset and Baselines for Visual Question Answering with Hypothetical Actions over Images | Jun 1, 2021 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge | Aug 9, 2017 | GPUVisual Question Answering | CodeCode Available | 0 |
| CLEVR_HYP: A Challenge Dataset and Baselines for Visual Question Answering with Hypothetical Actions over Images | Apr 13, 2021 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| CLEAR: A Dataset for Compositional Language and Elementary Acoustic Reasoning | Nov 26, 2018 | Acoustic Question AnsweringQuestion Answering | CodeCode Available | 0 |
| QuIIL at T3 challenge: Towards Automation in Life-Saving Intervention Procedures from First-Person View | Jul 18, 2024 | Action AnticipationAction Recognition | CodeCode Available | 0 |
| Latent Alignment and Variational Attention | Jul 10, 2018 | Hard AttentionMachine Translation | CodeCode Available | 0 |
| Large Models in Dialogue for Active Perception and Anomaly Detection | Jan 27, 2025 | Anomaly DetectionQuestion Answering | CodeCode Available | 0 |
| Large Language Models Understand Layout | Jul 8, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Kvasir-VQA-x1: A Multimodal Dataset for Medical Reasoning and Robust MedVQA in Gastrointestinal Endoscopy | Jun 11, 2025 | Medical Visual Question AnsweringQuestion Answering | CodeCode Available | 0 |
| DualVD: An Adaptive Dual Encoding Model for Deep Visual Understanding in Visual Dialogue | Nov 17, 2019 | feature selectionQuestion Answering | CodeCode Available | 0 |
| Dual Recurrent Attention Units for Visual Question Answering | Feb 1, 2018 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Diversify, Rationalize, and Combine: Ensembling Multiple QA Strategies for Zero-shot Knowledge-based VQA | Jun 18, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Dual Attention Networks for Visual Reference Resolution in Visual Dialog | Feb 25, 2019 | AI AgentQuestion Answering | CodeCode Available | 0 |
| RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding | May 20, 2025 | Image CaptioningQuestion Answering | CodeCode Available | 0 |
| CAST: Cross-modal Alignment Similarity Test for Vision Language Models | Sep 17, 2024 | cross-modal alignmentQuestion Answering | CodeCode Available | 0 |