| Multimodal ChatGPT for Medical Applications: an Experimental Study of GPT-4V | Oct 29, 2023 | DiagnosticLanguage Modeling | CodeCode Available | 1 |
| Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering | May 22, 2025 | BenchmarkingEvidence Selection | CodeCode Available | 1 |
| Debiased Visual Question Answering from Feature and Sample Perspectives | Dec 1, 2021 | Bias DetectionQuestion Answering | CodeCode Available | 1 |
| IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning | Oct 25, 2021 | Arithmetic ReasoningMathematical Question Answering | CodeCode Available | 1 |
| Declaration-based Prompt Tuning for Visual Question Answering | May 5, 2022 | Image-text matchingLanguage Modeling | CodeCode Available | 1 |
| How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game | Mar 13, 2025 | Multimodal ReasoningQuestion Answering | CodeCode Available | 1 |
| EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images | Oct 28, 2023 | Decision MakingMedical Visual Question Answering | CodeCode Available | 1 |
| Hierarchical Question-Image Co-Attention for Visual Question Answering | May 31, 2016 | Visual DialogVisual Question Answering | CodeCode Available | 1 |
| Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training | May 24, 2021 | Image CaptioningMedical Visual Question Answering | CodeCode Available | 1 |
| Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification | Jun 8, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering | Sep 18, 2020 | Out-of-Distribution GeneralizationQuestion Answering | CodeCode Available | 1 |
| Nearest Neighbor Normalization Improves Multimodal Retrieval | Oct 31, 2024 | Cross-Modal RetrievalImage Captioning | CodeCode Available | 1 |
| Notes-guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question Answering | Jan 1, 2025 | Large Language ModelMultimodal Large Language Model | CodeCode Available | 1 |
| How Much Can CLIP Benefit Vision-and-Language Tasks? | Jul 13, 2021 | Question AnsweringVision and Language Navigation | CodeCode Available | 1 |
| Distilled Dual-Encoder Model for Vision-Language Understanding | Dec 16, 2021 | Image to textmodel | CodeCode Available | 1 |
| ChiQA: A Large Scale Image-based Real-World Question Answering Dataset for Multi-Modal Understanding | Aug 5, 2022 | Image RetrievalQuestion Answering | CodeCode Available | 1 |
| Hierarchical multimodal transformers for Multi-Page DocVQA | Dec 7, 2022 | DecoderQuestion Answering | CodeCode Available | 1 |
| How to Configure Good In-Context Sequence for Visual Question Answering | Dec 4, 2023 | In-Context LearningQuestion Answering | CodeCode Available | 1 |
| IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages | Jan 27, 2022 | Cross-Modal RetrievalFew-Shot Learning | CodeCode Available | 1 |
| Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models | May 31, 2023 | Cross-Modal RetrievalQuestion Answering | CodeCode Available | 1 |
| ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification | Apr 29, 2025 | DiagnosticQuestion Answering | CodeCode Available | 1 |
| Check It Again:Progressive Visual Question Answering via Visual Entailment | Aug 1, 2021 | Question AnsweringVisual Entailment | CodeCode Available | 1 |
| Check It Again: Progressive Visual Question Answering via Visual Entailment | Jun 8, 2021 | Question AnsweringVisual Entailment | CodeCode Available | 1 |
| DeVLBert: Learning Deconfounded Visio-Linguistic Representations | Aug 16, 2020 | Image RetrievalQuestion Answering | CodeCode Available | 1 |
| HAAR: Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles | Dec 18, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |