| Multimodal fusion of imaging and genomics for lung cancer recurrence prediction | Feb 5, 2020 | Computed Tomography (CT)Question Answering | CodeCode Available | 1 |
| Fine-grained Image Classification and Retrieval by Combining Visual and Locally Pooled Textual Features | Jan 14, 2020 | ClassificationDiversity | CodeCode Available | 1 |
| In Defense of Grid Features for Visual Question Answering | Jan 10, 2020 | Image CaptioningQuestion Answering | CodeCode Available | 1 |
| Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline | Dec 5, 2019 | Language ModellingRepresentation Learning | CodeCode Available | 1 |
| Overcoming Data Limitation in Medical Visual Question Answering | Sep 26, 2019 | DenoisingMedical Visual Question Answering | CodeCode Available | 1 |
| UNITER: UNiversal Image-TExt Representation Learning | Sep 25, 2019 | Image-text matchingImage-text Retrieval | CodeCode Available | 1 |
| Don't Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases | Sep 9, 2019 | Natural Language InferenceQuestion Answering | CodeCode Available | 1 |
| VL-BERT: Pre-training of Generic Visual-Linguistic Representations | Aug 22, 2019 | Image-text matchingLanguage Modelling | CodeCode Available | 1 |
| LXMERT: Learning Cross-Modality Encoder Representations from Transformers | Aug 20, 2019 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks | Aug 6, 2019 | Image RetrievalQuestion Answering | CodeCode Available | 1 |
| Scene Text Visual Question Answering | May 31, 2019 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge | May 31, 2019 | object-detectionObject Detection | CodeCode Available | 1 |
| Gated Hierarchical Attention for Image Captioning | Oct 30, 2018 | DecoderImage Captioning | CodeCode Available | 1 |
| Faithful Multimodal Explanation for Visual Question Answering | Sep 8, 2018 | Explanatory Visual Question AnsweringQuestion Answering | CodeCode Available | 1 |
| R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering | May 24, 2018 | Question AnsweringRelation | CodeCode Available | 1 |
| AI2-THOR: An Interactive 3D Environment for Visual AI | Dec 14, 2017 | Deep Reinforcement LearningImitation Learning | CodeCode Available | 1 |
| Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments | Nov 20, 2017 | Reinforcement LearningTranslation | CodeCode Available | 1 |
| Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering | Jul 25, 2017 | Image CaptioningVisual Question Answering | CodeCode Available | 1 |
| Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning | Mar 20, 2017 | Deep Reinforcement Learningreinforcement-learning | CodeCode Available | 1 |
| CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning | Dec 20, 2016 | DiagnosticQuestion Answering | CodeCode Available | 1 |
| Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization | Oct 7, 2016 | General ClassificationImage Attribution | CodeCode Available | 1 |
| Hierarchical Question-Image Co-Attention for Visual Question Answering | May 31, 2016 | Visual DialogVisual Question Answering | CodeCode Available | 1 |
| VQA: Visual Question Answering | May 3, 2015 | Image CaptioningMultiple-choice | CodeCode Available | 1 |
| Barriers in Integrating Medical Visual Question Answering into Radiology Workflows: A Scoping Review and Clinicians' Insights | Jul 9, 2025 | DiagnosticMedical Visual Question Answering | —Unverified | 0 |
| MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning | Jul 9, 2025 | DiagnosticMultimodal Reasoning | —Unverified | 0 |
| Evaluating Attribute Confusion in Fashion Text-to-Image Generation | Jul 9, 2025 | Attributecross-modal alignment | —Unverified | 0 |
| LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation | Jul 9, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling | Jul 8, 2025 | ArticlesMultimodal Reasoning | —Unverified | 0 |
| ReLoop: "Seeing Twice and Thinking Backwards" via Closed-loop Training to Mitigate Hallucinations in Multimodal understanding | Jul 7, 2025 | HallucinationQuestion Answering | —Unverified | 0 |
| Revisiting CroPA: A Reproducibility Study and Enhancements for Cross-Prompt Adversarial Transferability in Vision-Language Models | Jun 28, 2025 | image-classificationImage Classification | CodeCode Available | 0 |
| DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document Images | Jun 26, 2025 | document understandingOptical Character Recognition (OCR) | CodeCode Available | 0 |
| SMMILE: An Expert-Driven Benchmark for Multimodal Medical In-Context Learning | Jun 26, 2025 | In-Context LearningMedical Visual Question Answering | —Unverified | 0 |
| FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering | Jun 25, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| HRIBench: Benchmarking Vision-Language Models for Real-Time Human Perception in Human-Robot Interaction | Jun 25, 2025 | BenchmarkingPerson Identification | CodeCode Available | 0 |
| Semantic-enhanced Modality-asymmetric Retrieval for Online E-commerce Search | Jun 25, 2025 | Question AnsweringRetrieval | —Unverified | 0 |
| GEMeX-ThinkVG: Towards Thinking with Visual Grounding in Medical VQA via Reinforcement Learning | Jun 22, 2025 | Answer GenerationDecision Making | —Unverified | 0 |
| Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations | Jun 21, 2025 | Question AnsweringScene Understanding | —Unverified | 0 |
| Can Common VLMs Rival Medical VLMs? Evaluation and Strategic Insights | Jun 19, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| MEGC2025: Micro-Expression Grand Challenge on Spot Then Recognize and Visual Question Answering | Jun 18, 2025 | Multimodal ReasoningQuestion Answering | —Unverified | 0 |
| Adapting Lightweight Vision Language Models for Radiological Visual Question Answering | Jun 17, 2025 | DiagnosticQuestion Answering | CodeCode Available | 0 |
| CAPO: Reinforcing Consistent Reasoning in Medical Decision-Making | Jun 15, 2025 | Answer GenerationDecision Making | —Unverified | 0 |
| AntiGrounding: Lifting Robotic Actions into VLM Representation Space for Decision Making | Jun 14, 2025 | Decision MakingQuestion Answering | —Unverified | 0 |
| MTabVQA: Evaluating Multi-Tabular Reasoning of Language Models in Visual Space | Jun 13, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| A Fast, Reliable, and Secure Programming Language for LLM Agents with Code Actions | Jun 13, 2025 | Conformal PredictionQuestion Answering | —Unverified | 0 |
| HalLoc: Token-level Localization of Hallucinations for Vision Language Models | Jun 12, 2025 | HallucinationImage Captioning | CodeCode Available | 0 |
| SlotPi: Physics-informed Object-centric Reasoning Models | Jun 12, 2025 | ObjectQuestion Answering | CodeCode Available | 0 |
| Provoking Multi-modal Few-Shot LVLM via Exploration-Exploitation In-Context Learning | Jun 11, 2025 | In-Context LearningQuestion Answering | —Unverified | 0 |
| Outside Knowledge Conversational Video (OKCV) Dataset -- Dialoguing over Videos | Jun 11, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Kvasir-VQA-x1: A Multimodal Dataset for Medical Reasoning and Robust MedVQA in Gastrointestinal Endoscopy | Jun 11, 2025 | Medical Visual Question AnsweringQuestion Answering | CodeCode Available | 0 |
| An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models | Jun 10, 2025 | Action GenerationImage Captioning | —Unverified | 0 |