NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks Mar 9, 2022 Decision Making Explainable artificial intelligence
Code Code Available 15 Generative Bias for Robust Visual Question Answering Aug 1, 2022 Knowledge Distillation Question Answering
Code Code Available 15 Are Bias Mitigation Techniques for Deep Learning Effective? Apr 1, 2021 Deep Learning Question Answering
Code Code Available 15 GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution May 27, 2025 8k Avg
Code Code Available 15 End-to-end Document Recognition and Understanding with Dessurt Mar 30, 2022 document understanding Visual Question Answering (VQA)
Code Code Available 15 Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator Dec 11, 2023 Image Captioning Question Answering
Code Code Available 15 End-to-end Knowledge Retrieval with Multi-modal Queries Jun 1, 2023 Benchmarking Cross-Modal Retrieval
Code Code Available 15 Calibrating Concepts and Operations: Towards Symbolic Reasoning on Real Images Oct 1, 2021 Question Answering Visual Question Answering
Code Code Available 15 Notes-guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question Answering Jan 1, 2025 Large Language Model Multimodal Large Language Model
Code Code Available 15 Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering Jul 25, 2017 Image Captioning Visual Question Answering
Code Code Available 15 DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering Jul 10, 2021 Graph Attention Question Answering
Code Code Available 15 Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer Feb 18, 2021 Decoder Document Image Classification
Code Code Available 15 Can I Trust Your Answer? Visually Grounded Video Question Answering Sep 4, 2023 Grounded Video Question Answering Question Answering
Code Code Available 15 FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs Mar 27, 2025 Attribute Benchmarking
Code Code Available 15 Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs Oct 15, 2020 Language Modeling Language Modelling
Code Code Available 15 Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization Oct 7, 2016 General Classification Image Attribution
Code Code Available 15 NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions May 18, 2021 Question Answering Video Question Answering
Code Code Available 15 Graph Optimal Transport for Cross-Domain Alignment Jun 26, 2020 Graph Matching Image Captioning
Code Code Available 15 Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? Feb 23, 2023 Open-Domain Question Answering Question Answering
Code Code Available 15 GRIT: General Robust Image Task Benchmark Apr 28, 2022 Instance Segmentation Keypoint Detection
Code Code Available 15 Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner May 19, 2023 Dense Captioning Image Captioning
Code Code Available 15 Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts Apr 12, 2024 Image Captioning Question Answering
Code Code Available 15 Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-Consistency Training Nov 23, 2023 Multimodal Reasoning Science Question Answering
Code Code Available 15 Dual-Key Multimodal Backdoors for Visual Question Answering Dec 14, 2021 Question Answering Visual Question Answering
Code Code Available 15 eP-ALM: Efficient Perceptual Augmentation of Language Models Mar 20, 2023 In-Context Learning Visual Question Answering (VQA)
Code Code Available 15 OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge May 31, 2019 object-detection Object Detection
Code Code Available 15 Exploring Opinion-unaware Video Quality Assessment with Semantic Affinity Criterion Feb 26, 2023 Video Quality Assessment Visual Question Answering (VQA)
Code Code Available 15 ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding Oct 12, 2022 document-image-classification Document Image Classification
Code Code Available 15 NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions Jun 19, 2021 Question Answering Video Question Answering
Code Code Available 15 Hierarchical multimodal transformers for Multi-Page DocVQA Dec 7, 2022 Decoder Question Answering
Code Code Available 15 GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection Nov 5, 2023 Anomaly Detection Question Answering
Code Code Available 15 Hierarchical Conditional Relation Networks for Video Question Answering Feb 25, 2020 Audio-Visual Question Answering (AVQA) Question Answering
Code Code Available 15 FFAA: Multimodal Large Language Model based Explainable Open-World Face Forgery Analysis Assistant Aug 19, 2024 Descriptive Face Swapping
Code Code Available 15 Introspective Distillation for Robust Question Answering Nov 1, 2021 counterfactual Inductive Bias
Code Code Available 15 How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs Nov 27, 2023 Adversarial Robustness Visual Question Answering (VQA)
Code Code Available 15 Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering Sep 19, 2024 Hallucination Hallucination Evaluation
Code Code Available 15 MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering Sep 18, 2020 Out-of-Distribution Generalization Question Answering
Code Code Available 15 Overcoming Language Priors with Self-supervised Learning for Visual Question Answering Dec 17, 2020 Question Answering Self-Supervised Learning
Code Code Available 15 How to Configure Good In-Context Sequence for Visual Question Answering Dec 4, 2023 In-Context Learning Question Answering
Code Code Available 15 Towards More Faithful Natural Language Explanation Using Multi-Level Contrastive Learning in VQA Dec 21, 2023 Contrastive Learning counterfactual
Code Code Available 15 DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document Images Jun 26, 2025 document understanding Optical Character Recognition (OCR)
Code Code Available 05 BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection Jan 31, 2019 Question Answering Relationship Detection
Code Code Available 05 Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering Dec 1, 2017 Question Answering Visual Question Answering
Code Code Available 05 Multimodal Residual Learning for Visual QA Jun 5, 2016 Multiple-choice Question Answering
Code Code Available 05 Blind VQA on 360° Video via Progressively Learning from Pixels, Frames and Video Nov 18, 2021 Visual Question Answering (VQA)
Code Code Available 05 Blind Prediction of Natural Video Quality Jan 9, 2014 Prediction Video Quality Assessment
Code Code Available 05 A Neuro-Symbolic ASP Pipeline for Visual Question Answering May 16, 2022 Question Answering Visual Question Answering
Code Code Available 05 Multimodal Explanations: Justifying Decisions and Pointing to the Evidence Feb 15, 2018 Activity Recognition Explainable Models
Code Code Available 05 Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents Nov 23, 2024 Question Answering RAG
Code Code Available 05 Biomedical Visual Instruction Tuning with Clinician Preference Alignment Jun 19, 2024 Instruction Following Visual Question Answering (VQA)
Code Code Available 05