Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers Mar 29, 2021 Decoder Image Segmentation
Code Code Available 1GeneAnnotator: A Semi-automatic Annotation Tool for Visual Scene Graph Sep 6, 2021 Graph Generation Graph Learning
Code Code Available 1Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs? Jan 5, 2025 Image Captioning Image to text
Code Code Available 1Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer Feb 18, 2021 Decoder Document Image Classification
Code Code Available 1Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator Dec 11, 2023 Image Captioning Question Answering
Code Code Available 1From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis Jun 28, 2024 Visual Question Answering (VQA) Visual Reasoning
Code Code Available 1FunQA: Towards Surprising Video Comprehension Jun 26, 2023 Question Answering Text Generation
Code Code Available 1Graph Optimal Transport for Cross-Domain Alignment Jun 26, 2020 Graph Matching Image Captioning
Code Code Available 12BiVQA: Double Bi-LSTM based Video Quality Assessment of UGC Videos Aug 31, 2022 Video Quality Assessment Visual Question Answering (VQA)
Code Code Available 1GRIT: General Robust Image Task Benchmark Apr 28, 2022 Instance Segmentation Keypoint Detection
Code Code Available 1HallE-Control: Controlling Object Hallucination in Large Multimodal Models Oct 3, 2023 Attribute Decoder
Code Code Available 1Attention in Reasoning: Dataset, Analysis, and Modeling Apr 20, 2022 Question Answering Visual Question Answering
Code Code Available 1Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts Nov 16, 2024 Mixture-of-Experts Optical Character Recognition (OCR)
Code Code Available 1HIDRO-VQA: High Dynamic Range Oracle for Video Quality Assessment Nov 18, 2023 Video Quality Assessment Visual Question Answering (VQA)
Code Code Available 1How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs Nov 27, 2023 Adversarial Robustness Visual Question Answering (VQA)
Code Code Available 1How Much Can CLIP Benefit Vision-and-Language Tasks? Jul 13, 2021 Question Answering Vision and Language Navigation
Code Code Available 1BadCM: Invisible Backdoor Attack Against Cross-Modal Learning Oct 3, 2024 Backdoor Attack Cross-Modal Retrieval
Code Code Available 1Gemini Goes to Med School: Exploring the Capabilities of Multimodal Large Language Models on Medical Challenge Problems & Hallucinations Feb 10, 2024 Diagnostic Hallucination
Code Code Available 1I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision Nov 17, 2022 Image Captioning Question Answering
Code Code Available 1IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning Oct 25, 2021 Arithmetic Reasoning Mathematical Question Answering
Code Code Available 1GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution May 27, 2025 8k Avg
Code Code Available 1GraghVQA: Language-Guided Graph Neural Networks for Graph-based Visual Question Answering Apr 20, 2021 Graph Neural Network Graph Question Answering
Code Code Available 1Improving Selective Visual Question Answering by Learning from Your Peers Jun 14, 2023 Question Answering Visual Question Answering
Code Code Available 1A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports Sep 3, 2020 Image-text Retrieval Medical Visual Question Answering
Code Code Available 1An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models Nov 9, 2024 object-detection Object Detection
Code Code Available 1Learning to Answer Visual Questions from Web Videos May 10, 2022 Dataset Generation Question Answering
Code Code Available 13D-Aware Visual Question Answering about Parts, Poses and Occlusions Oct 27, 2023 Question Answering Visual Question Answering
Code Code Available 1Attention-Based Context Aware Reasoning for Situation Recognition Jun 1, 2020 Action Recognition Fine-grained Action Recognition
Code Code Available 1FloodNet: A High Resolution Aerial Imagery Dataset for Post Flood Scene Understanding Dec 5, 2020 image-classification Image Classification
Code Code Available 1Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning May 10, 2021 Arithmetic Reasoning Geometry Problem Solving
Code Code Available 1A Comprehensive Evaluation of GPT-4V on Knowledge-Intensive Visual Question Answering Nov 13, 2023 Decision Making Explanation Generation
Code Code Available 1Change Detection Meets Visual Question Answering Dec 12, 2021 Answer Generation Change Detection
Code Code Available 1An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling Sep 4, 2022 Fill Mask Optical Flow Estimation
Code Code Available 1Fine-grained Image Classification and Retrieval by Combining Visual and Locally Pooled Textual Features Jan 14, 2020 Classification Diversity
Code Code Available 1An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA Sep 10, 2021 Image Captioning Question Answering
Code Code Available 1Just Ask: Learning to Answer Questions from Millions of Narrated Videos Dec 1, 2020 Question Answering Question Generation
Code Code Available 1Beyond Question-Based Biases: Assessing Multimodal Shortcut Learning in Visual Question Answering Apr 7, 2021 Question Answering Visual Question Answering
Code Code Available 1An Empirical Study of Multimodal Model Merging Apr 28, 2023 model Retrieval
Code Code Available 1Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning Oct 1, 2023 In-Context Learning Instruction Following
Code Code Available 1An Empirical Study of Training End-to-End Vision-and-Language Transformers Nov 3, 2021 Cross-Modal Retrieval Decoder
Code Code Available 1Can I Trust Your Answer? Visually Grounded Video Question Answering Sep 4, 2023 Grounded Video Question Answering Question Answering
Code Code Available 1Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering Dec 14, 2021 Graph Matching Question Answering
Code Code Available 1Florence: A New Foundation Model for Computer Vision Nov 22, 2021 Action Classification Action Recognition
Code Code Available 1FAVER: Blind Quality Prediction of Variable Frame Rate Videos Jan 5, 2022 Cloud Computing Video Quality Assessment
Code Code Available 1Language-Informed Visual Concept Learning Dec 6, 2023 Disentanglement Novel Concepts
Code Code Available 1Language Prior Is Not the Only Shortcut: A Benchmark for Shortcut Learning in VQA Oct 10, 2022 Question Answering Visual Question Answering
Code Code Available 1Check It Again:Progressive Visual Question Answering via Visual Entailment Aug 1, 2021 Question Answering Visual Entailment
Code Code Available 1Blindly Assess Quality of In-the-Wild Videos via Quality-aware Pre-training and Motion Perception Aug 19, 2021 Action Recognition Image Quality Assessment
Code Code Available 1Check It Again: Progressive Visual Question Answering via Visual Entailment Jun 8, 2021 Question Answering Visual Entailment
Code Code Available 1A Symmetric Dual Encoding Dense Retrieval Framework for Knowledge-Intensive Visual Question Answering Apr 26, 2023 Decoder Knowledge Distillation
Code Code Available 1