Faithful Multimodal Explanation for Visual Question Answering Sep 8, 2018 Explanatory Visual Question Answering Question Answering
Code Code Available 15 CRAFT: A Benchmark for Causal Reasoning About Forces and inTeractions Dec 8, 2020 counterfactual Descriptive
Code Code Available 15 A Unified End-to-End Retriever-Reader Framework for Knowledge-based VQA Jun 30, 2022 Question Answering Retrieval
Code Code Available 15 How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs Nov 27, 2023 Adversarial Robustness Visual Question Answering (VQA)
Code Code Available 15 A Dataset and Baselines for Visual Question Answering on Art Aug 28, 2020 Question Answering Question Generation
Code Code Available 15 CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers May 27, 2023 Image Captioning Image Retrieval
Code Code Available 15 UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling Nov 23, 2021 Image Captioning Image Description
Code Code Available 15 AMD-Hummingbird: Towards an Efficient Text-to-Video Model Mar 24, 2025 Computational Efficiency Video Generation
Code Code Available 15 Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering Jul 26, 2022 Causal Inference Question Answering
Code Code Available 15 Dynamic Language Binding in Relational Visual Reasoning Apr 30, 2020 Object Question Answering
Code Code Available 15 OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge May 31, 2019 object-detection Object Detection
Code Code Available 15 LXMERT: Learning Cross-Modality Encoder Representations from Transformers Aug 20, 2019 Language Modeling Language Modelling
Code Code Available 15 Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation Jan 6, 2025 Language Model Evaluation Language Modeling
Code Code Available 15 Panoramic Vision Transformer for Saliency Detection in 360° Videos Sep 19, 2022 Saliency Detection Saliency Prediction
Code Code Available 15 MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting Oct 13, 2022 Image Captioning Question Answering
Code Code Available 15 In Defense of Grid Features for Visual Question Answering Jan 10, 2020 Image Captioning Question Answering
Code Code Available 15 Dual-Key Multimodal Backdoors for Visual Question Answering Dec 14, 2021 Question Answering Visual Question Answering
Code Code Available 15 IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning Oct 25, 2021 Arithmetic Reasoning Mathematical Question Answering
Code Code Available 15 Analysis of Video Quality Datasets via Design of Minimalistic Video Quality Models Jul 26, 2023 Image Quality Assessment No-Reference Image Quality Assessment
Code Code Available 15 Localized Questions in Medical Visual Question Answering Jul 3, 2023 Medical Visual Question Answering Question Answering
Code Code Available 15 IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models Mar 23, 2024 Common Sense Reasoning In-Context Learning
Code Code Available 15 PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models May 23, 2022 Language Modeling Language Modelling
Code Code Available 15 Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? Feb 23, 2023 Open-Domain Question Answering Question Answering
Code Code Available 15 A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge Jun 3, 2022 Question Answering Visual Question Answering
Code Code Available 15 PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering May 17, 2023 Benchmarking Diagnostic
Code Code Available 15 LRTA: A Transparent Neural-Symbolic Reasoning Framework with Modular Supervision for Visual Question Answering Nov 21, 2020 Answer Generation Question Answering
Code Code Available 15 Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts Nov 16, 2024 Mixture-of-Experts Optical Character Recognition (OCR)
Code Code Available 15 Improving Selective Visual Question Answering by Learning from Your Peers Jun 14, 2023 Question Answering Visual Question Answering
Code Code Available 15 Can I Trust Your Answer? Visually Grounded Video Question Answering Sep 4, 2023 Grounded Video Question Answering Question Answering
Code Code Available 15 Prismer: A Vision-Language Model with Multi-Task Experts Mar 4, 2023 Few-Shot Learning Image Captioning
Code Code Available 15 DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering Jul 10, 2021 Graph Attention Question Answering
Code Code Available 15 Calibrating Concepts and Operations: Towards Symbolic Reasoning on Real Images Oct 1, 2021 Question Answering Visual Question Answering
Code Code Available 15 Does Vision-and-Language Pretraining Improve Lexical Grounding? Sep 21, 2021 Question Answering Visual Question Answering
Code Code Available 15 DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback Oct 8, 2024 Math Sequential Decision Making
Code Code Available 15 DocVQA: A Dataset for VQA on Document Images Jul 1, 2020 Question Answering Reading Comprehension
Code Code Available 15 ProTo: Program-Guided Transformer for Program-Guided Tasks Oct 2, 2021 Decision Making Learning to Execute
Code Code Available 15 Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering Jun 29, 2023 Answer Generation Question Answering
Code Code Available 15 Instruction-Guided Visual Masking May 30, 2024 Instruction Following Visual Grounding
Code Code Available 15 LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content Oct 14, 2024 Visual Question Answering (VQA) World Knowledge
Code Code Available 15 Light-VQA: A Multi-Dimensional Quality Assessment Model for Low-Light Video Enhancement May 16, 2023 Video Enhancement Video Quality Assessment
Code Code Available 15 InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks Dec 21, 2023 Image Retrieval Image-to-Text Retrieval
Code Code Available 15 Debiased Visual Question Answering from Feature and Sample Perspectives Dec 1, 2021 Bias Detection Question Answering
Code Code Available 15 Debiasing Multimodal Models via Causal Information Minimization Nov 28, 2023 Visual Question Answering (VQA)
Code Code Available 15 Declaration-based Prompt Tuning for Visual Question Answering May 5, 2022 Image-text matching Language Modeling
Code Code Available 15 DocFormerv2: Local Features for Document Understanding Jun 2, 2023 Decoder document understanding
Code Code Available 15 Light-VQA+: A Video Quality Assessment Model for Exposure Correction with Vision-Language Guidance May 6, 2024 Exposure Correction Video Enhancement
Code Code Available 15 Visual Grounding Methods for VQA are Working for the Wrong Reasons! Apr 12, 2020 Question Answering Visual Grounding
Code Code Available 15 ReLaX-VQA: Residual Fragment and Layer Stack Extraction for Enhancing Video Quality Assessment Jul 16, 2024 Optical Flow Estimation Video Compression
Code Code Available 15 Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder Jun 28, 2025 Image Segmentation Large Language Model
Code Code Available 15 Distilled Dual-Encoder Model for Vision-Language Understanding Dec 16, 2021 Image to text model
Code Code Available 15