Hierarchical Conditional Relation Networks for Video Question Answering Feb 25, 2020 Audio-Visual Question Answering (AVQA) Question Answering
Code Code Available 1Rethinking Data Augmentation for Robust Visual Question Answering Jul 18, 2022 Data Augmentation Knowledge Distillation
Code Code Available 1Reverse Image Retrieval Cues Parametric Memory in Multimodal LLMs May 29, 2024 Image Retrieval Question Answering
Code Code Available 1REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering Jun 2, 2022 Question Answering Retrieval
Code Code Available 13D-Aware Visual Question Answering about Parts, Poses and Occlusions Oct 27, 2023 Question Answering Visual Question Answering
Code Code Available 1Deep Multimodal Neural Architecture Search Apr 25, 2020 Decoder Image-text matching
Code Code Available 1GRIT: General Robust Image Task Benchmark Apr 28, 2022 Instance Segmentation Keypoint Detection
Code Code Available 1Can I Trust Your Answer? Visually Grounded Video Question Answering Sep 4, 2023 Grounded Video Question Answering Question Answering
Code Code Available 1Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset Nov 5, 2024 Benchmarking Language Modeling
Code Code Available 1Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models Jul 9, 2023 Question Answering TGIF-Frame
Code Code Available 1Scalable Neural-Probabilistic Answer Set Programming Jun 14, 2023 Probabilistic Programming Question Answering
Code Code Available 1Describe Anything Model for Visual Question Answering on Text-rich Images Jul 16, 2025 Descriptive Language Modeling
Code Code Available 1A Comprehensive Evaluation of GPT-4V on Knowledge-Intensive Visual Question Answering Nov 13, 2023 Decision Making Explanation Generation
Code Code Available 1Searching the Search Space of Vision Transformer Nov 29, 2021 Neural Architecture Search object-detection
Code Code Available 1Graph Optimal Transport for Cross-Domain Alignment Jun 26, 2020 Graph Matching Image Captioning
Code Code Available 1Greedy Gradient Ensemble for Robust Visual Question Answering Jul 27, 2021 Question Answering Visual Question Answering
Code Code Available 1HAAR: Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles Dec 18, 2023 Question Answering Visual Question Answering
Code Code Available 1Hierarchical multimodal transformers for Multi-Page DocVQA Dec 7, 2022 Decoder Question Answering
Code Code Available 1DeVLBert: Learning Deconfounded Visio-Linguistic Representations Aug 16, 2020 Image Retrieval Question Answering
Code Code Available 1Semantic Equivalent Adversarial Data Augmentation for Visual Question Answering Jul 19, 2020 Adversarial Attack Data Augmentation
Code Code Available 1An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA Sep 10, 2021 Image Captioning Question Answering
Code Code Available 1Skipping Computations in Multimodal LLMs Oct 12, 2024 Question Answering Visual Question Answering
Code Code Available 1SlideVQA: A Dataset for Document Visual Question Answering on Multiple Images Jan 12, 2023 Evidence Selection Question Answering
Code Code Available 1SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models Oct 12, 2022 Object Question Answering
Code Code Available 1Calibrating Concepts and Operations: Towards Symbolic Reasoning on Real Images Oct 1, 2021 Question Answering Visual Question Answering
Code Code Available 1Spatially Aware Multimodal Transformers for TextVQA Jul 23, 2020 Optical Character Recognition (OCR) Spatial Reasoning
Code Code Available 1An Empirical Study of Multimodal Model Merging Apr 28, 2023 model Retrieval
Code Code Available 1GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering Feb 25, 2019 Question Answering Visual Question Answering (VQA)
Code Code Available 1Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization Oct 7, 2016 General Classification Image Attribution
Code Code Available 1Faithful Multimodal Explanation for Visual Question Answering Sep 8, 2018 Explanatory Visual Question Answering Question Answering
Code Code Available 1An Empirical Study of Training End-to-End Vision-and-Language Transformers Nov 3, 2021 Cross-Modal Retrieval Decoder
Code Code Available 1FiLM: Visual Reasoning with a General Conditioning Layer Sep 22, 2017 Image Retrieval with Multi-Modal Query Visual Question Answering (VQA)
Code Code Available 1FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs Mar 27, 2025 Attribute Benchmarking
Code Code Available 1Disentangling 3D Prototypical Networks For Few-Shot Concept Learning Nov 6, 2020 3D geometry 3D Object Detection
Code Code Available 1Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering Jun 29, 2023 Answer Generation Question Answering
Code Code Available 1Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering Dec 14, 2021 Graph Matching Question Answering
Code Code Available 1GraghVQA: Language-Guided Graph Neural Networks for Graph-based Visual Question Answering Apr 20, 2021 Graph Neural Network Graph Question Answering
Code Code Available 1Distilled Dual-Encoder Model for Vision-Language Understanding Dec 16, 2021 Image to text model
Code Code Available 1GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution May 27, 2025 8k Avg
Code Code Available 13DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding Jan 6, 2024 Scene Understanding Visual Question Answering (VQA)
Code Code Available 1T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition Sep 29, 2024 In-Context Learning Question Answering
Code Code Available 1Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer Feb 18, 2021 Decoder Document Image Classification
Code Code Available 1Graphhopper: Multi-Hop Scene Graph Reasoning for Visual Question Answering Jul 13, 2021 Navigate Question Answering
Code Code Available 1TAP: Text-Aware Pre-training for Text-VQA and Text-Caption Dec 8, 2020 Caption Generation Language Modeling
Code Code Available 1Hierarchical Question-Image Co-Attention for Visual Question Answering May 31, 2016 Visual Dialog Visual Question Answering
Code Code Available 1TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding Apr 15, 2024 Question Answering Visual Question Answering (VQA)
Code Code Available 1Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs? Jan 5, 2025 Image Captioning Image to text
Code Code Available 1DocVQA: A Dataset for VQA on Document Images Jul 1, 2020 Question Answering Reading Comprehension
Code Code Available 1Think Locally, Act Globally: Federated Learning with Local and Global Representations Jan 6, 2020 Federated Learning Representation Learning
Code Code Available 1Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA Feb 24, 2024 3D Question Answering (3D-QA) Question Answering
Code Code Available 1