Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training May 24, 2021 Image Captioning Medical Visual Question Answering
Code Code Available 1CRAFT: A Benchmark for Causal Reasoning About Forces and inTeractions Dec 8, 2020 counterfactual Descriptive
Code Code Available 1HIDRO-VQA: High Dynamic Range Oracle for Video Quality Assessment Nov 18, 2023 Video Quality Assessment Visual Question Answering (VQA)
Code Code Available 1Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs Oct 15, 2020 Language Modeling Language Modelling
Code Code Available 1A Dataset and Baselines for Visual Question Answering on Art Aug 28, 2020 Question Answering Question Generation
Code Code Available 1CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers May 27, 2023 Image Captioning Image Retrieval
Code Code Available 1How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs Nov 27, 2023 Adversarial Robustness Visual Question Answering (VQA)
Code Code Available 1AMD-Hummingbird: Towards an Efficient Text-to-Video Model Mar 24, 2025 Computational Efficiency Video Generation
Code Code Available 1GRIT: General Robust Image Task Benchmark Apr 28, 2022 Instance Segmentation Keypoint Detection
Code Code Available 1Notes-guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question Answering Jan 1, 2025 Large Language Model Multimodal Large Language Model
Code Code Available 1HAAR: Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles Dec 18, 2023 Question Answering Visual Question Answering
Code Code Available 1Detecting Hate Speech in Multi-modal Memes Dec 29, 2020 Binary Classification Hate Speech Detection
Code Code Available 1OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge May 31, 2019 object-detection Object Detection
Code Code Available 1NuScenes-MQA: Integrated Evaluation of Captions and QA for Autonomous Driving Datasets using Markup Annotations Dec 11, 2023 Autonomous Driving Descriptive
Code Code Available 1DeVLBert: Learning Deconfounded Visio-Linguistic Representations Aug 16, 2020 Image Retrieval Question Answering
Code Code Available 1HallE-Control: Controlling Object Hallucination in Large Multimodal Models Oct 3, 2023 Attribute Decoder
Code Code Available 1Ontology-guided Semantic Composition for Zero-Shot Learning Jun 30, 2020 image-classification Image Classification
Code Code Available 1Open3DVQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space Mar 14, 2025 Language Modeling Language Modelling
Code Code Available 1How Much Can CLIP Benefit Vision-and-Language Tasks? Jul 13, 2021 Question Answering Vision and Language Navigation
Code Code Available 1In Defense of Grid Features for Visual Question Answering Jan 10, 2020 Image Captioning Question Answering
Code Code Available 1Language Prior Is Not the Only Shortcut: A Benchmark for Shortcut Learning in VQA Oct 10, 2022 Question Answering Visual Question Answering
Code Code Available 1GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering Feb 25, 2019 Question Answering Visual Question Answering (VQA)
Code Code Available 1Overcoming Language Priors with Self-supervised Learning for Visual Question Answering Dec 17, 2020 Question Answering Self-Supervised Learning
Code Code Available 1Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization Oct 7, 2016 General Classification Image Attribution
Code Code Available 1Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? Feb 23, 2023 Open-Domain Question Answering Question Answering
Code Code Available 1A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge Jun 3, 2022 Question Answering Visual Question Answering
Code Code Available 1GraghVQA: Language-Guided Graph Neural Networks for Graph-based Visual Question Answering Apr 20, 2021 Graph Neural Network Graph Question Answering
Code Code Available 1Pano-AVQA: Grounded Audio-Visual Question Answering on 360^ Videos Oct 11, 2021 Audio-visual Question Answering Question Answering
Code Code Available 1ParlAI: A Dialog Research Software Platform May 18, 2017 reinforcement-learning Reinforcement Learning
Code Code Available 1Passage Retrieval for Outside-Knowledge Visual Question Answering May 9, 2021 Image Captioning Object
Code Code Available 1Can I Trust Your Answer? Visually Grounded Video Question Answering Sep 4, 2023 Grounded Video Question Answering Question Answering
Code Code Available 1GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution May 27, 2025 8k Avg
Code Code Available 1PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models May 23, 2022 Language Modeling Language Modelling
Code Code Available 1DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback Oct 8, 2024 Math Sequential Decision Making
Code Code Available 1Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer Feb 18, 2021 Decoder Document Image Classification
Code Code Available 1End-to-end Document Recognition and Understanding with Dessurt Mar 30, 2022 document understanding Visual Question Answering (VQA)
Code Code Available 1Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers Apr 2, 2020 Image-text matching Image-text Retrieval
Code Code Available 1Graphhopper: Multi-Hop Scene Graph Reasoning for Visual Question Answering Jul 13, 2021 Navigate Question Answering
Code Code Available 1Calibrating Concepts and Operations: Towards Symbolic Reasoning on Real Images Oct 1, 2021 Question Answering Visual Question Answering
Code Code Available 1Generative Bias for Robust Visual Question Answering Aug 1, 2022 Knowledge Distillation Question Answering
Code Code Available 1Probing Image-Language Transformers for Verb Understanding Jun 16, 2021 Image Retrieval Question Answering
Code Code Available 1Debiased Visual Question Answering from Feature and Sample Perspectives Dec 1, 2021 Bias Detection Question Answering
Code Code Available 1Debiasing Multimodal Models via Causal Information Minimization Nov 28, 2023 Visual Question Answering (VQA)
Code Code Available 1Declaration-based Prompt Tuning for Visual Question Answering May 5, 2022 Image-text matching Language Modeling
Code Code Available 1GeneAnnotator: A Semi-automatic Annotation Tool for Visual Scene Graph Sep 6, 2021 Graph Generation Graph Learning
Code Code Available 1Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering Jun 29, 2023 Answer Generation Question Answering
Code Code Available 1Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs? Jan 5, 2025 Image Captioning Image to text
Code Code Available 1Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers Mar 29, 2021 Decoder Image Segmentation
Code Code Available 1Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder Jun 28, 2025 Image Segmentation Large Language Model
Code Code Available 1From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis Jun 28, 2024 Visual Question Answering (VQA) Visual Reasoning
Code Code Available 1