YouMakeup VQA Challenge: Towards Fine-grained Action Understanding in Domain-Specific Videos Apr 12, 2020 Action Understanding Question Answering
Code Code Available 1Evaluating Multimodal Representations on Visual Semantic Textual Similarity Apr 4, 2020 Benchmarking Image Captioning
Code Code Available 1Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers Apr 2, 2020 Image-text matching Image-text Retrieval
Code Code Available 1Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text Mar 31, 2020 Graph Neural Network Question Answering
Code Code Available 1X-Linear Attention Networks for Image Captioning Mar 31, 2020 Decoder Fine-Grained Visual Recognition
Code Code Available 1Ground Truth Evaluation of Neural Network Explanations with CLEVR-XAI Mar 16, 2020 Benchmarking Explainable Artificial Intelligence (XAI)
Code Code Available 1Counterfactual Samples Synthesizing for Robust Visual Question Answering Mar 14, 2020 counterfactual Question Answering
Code Code Available 1PathVQA: 30000+ Questions for Medical Visual Question Answering Mar 7, 2020 AI Agent Medical Visual Question Answering
Code Code Available 1Visual Commonsense R-CNN Feb 27, 2020 Image Captioning Representation Learning
Code Code Available 1Hierarchical Conditional Relation Networks for Video Question Answering Feb 25, 2020 Audio-Visual Question Answering (AVQA) Question Answering
Code Code Available 1Multimodal fusion of imaging and genomics for lung cancer recurrence prediction Feb 5, 2020 Computed Tomography (CT) Question Answering
Code Code Available 1Break It Down: A Question Understanding Benchmark Jan 31, 2020 Open-Domain Question Answering Question Answering
Code Code Available 1Fine-grained Image Classification and Retrieval by Combining Visual and Locally Pooled Textual Features Jan 14, 2020 Classification Diversity
Code Code Available 1In Defense of Grid Features for Visual Question Answering Jan 10, 2020 Image Captioning Question Answering
Code Code Available 1Think Locally, Act Globally: Federated Learning with Local and Global Representations Jan 6, 2020 Federated Learning Representation Learning
Code Code Available 1Overcoming Data Limitation in Medical Visual Question Answering Sep 26, 2019 Denoising Medical Visual Question Answering
Code Code Available 1UNITER: UNiversal Image-TExt Representation Learning Sep 25, 2019 Image-text matching Image-text Retrieval
Code Code Available 1Don't Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases Sep 9, 2019 Natural Language Inference Question Answering
Code Code Available 1VL-BERT: Pre-training of Generic Visual-Linguistic Representations Aug 22, 2019 Image-text matching Language Modelling
Code Code Available 1LXMERT: Learning Cross-Modality Encoder Representations from Transformers Aug 20, 2019 Language Modeling Language Modelling
Code Code Available 1VideoNavQA: Bridging the Gap between Visual and Embodied Question Answering Aug 14, 2019 Embodied Question Answering Question Answering
Code Code Available 1VisualBERT: A Simple and Performant Baseline for Vision and Language Aug 9, 2019 Language Modeling Language Modelling
Code Code Available 1ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks Aug 6, 2019 Image Retrieval Question Answering
Code Code Available 1OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge May 31, 2019 object-detection Object Detection
Code Code Available 1Scene Text Visual Question Answering May 31, 2019 Question Answering Visual Question Answering
Code Code Available 1GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering Feb 25, 2019 Question Answering Visual Question Answering (VQA)
Code Code Available 1Faithful Multimodal Explanation for Visual Question Answering Sep 8, 2018 Explanatory Visual Question Answering Question Answering
Code Code Available 1R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering May 24, 2018 Question Answering Relation
Code Code Available 1Compositional Attention Networks for Machine Reasoning Mar 8, 2018 Referring Expression Comprehension Visual Question Answering (VQA)
Code Code Available 1AI2-THOR: An Interactive 3D Environment for Visual AI Dec 14, 2017 Deep Reinforcement Learning Imitation Learning
Code Code Available 1Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments Nov 20, 2017 Reinforcement Learning Translation
Code Code Available 1FiLM: Visual Reasoning with a General Conditioning Layer Sep 22, 2017 Image Retrieval with Multi-Modal Query Visual Question Answering (VQA)
Code Code Available 1Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering Jul 25, 2017 Image Captioning Visual Question Answering
Code Code Available 1ParlAI: A Dialog Research Software Platform May 18, 2017 reinforcement-learning Reinforcement Learning
Code Code Available 1Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning Mar 20, 2017 Deep Reinforcement Learning reinforcement-learning
Code Code Available 1CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning Dec 20, 2016 Diagnostic Question Answering
Code Code Available 1Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization Oct 7, 2016 General Classification Image Attribution
Code Code Available 1Hierarchical Question-Image Co-Attention for Visual Question Answering May 31, 2016 Visual Dialog Visual Question Answering
Code Code Available 1Stacked Attention Networks for Image Question Answering Nov 7, 2015 Visual Question Answering (VQA)
Code Code Available 1VQA: Visual Question Answering May 3, 2015 Image Captioning Multiple-choice
Code Code Available 1VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning Jul 17, 2025 Language Modeling Language Modelling
Code Code Available 0MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM Jul 16, 2025 Attribute Face Swapping
— Unverified 0Evaluating Attribute Confusion in Fashion Text-to-Image Generation Jul 9, 2025 Attribute cross-modal alignment
— Unverified 0LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation Jul 9, 2025 Question Answering Visual Question Answering
— Unverified 0DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document Images Jun 26, 2025 document understanding Optical Character Recognition (OCR)
Code Code Available 0SMMILE: An Expert-Driven Benchmark for Multimodal Medical In-Context Learning Jun 26, 2025 In-Context Learning Medical Visual Question Answering
— Unverified 0Bridging Video Quality Scoring and Justification via Large Multimodal Models Jun 26, 2025 Video Quality Assessment Visual Question Answering (VQA)
— Unverified 0HRIBench: Benchmarking Vision-Language Models for Real-Time Human Perception in Human-Robot Interaction Jun 25, 2025 Benchmarking Person Identification
Code Code Available 0FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering Jun 25, 2025 Question Answering Visual Question Answering
— Unverified 0GEMeX-ThinkVG: Towards Thinking with Visual Grounding in Medical VQA via Reinforcement Learning Jun 22, 2025 Answer Generation Decision Making
— Unverified 0