| Learning to Answer Visual Questions from Web Videos | May 10, 2022 | Dataset GenerationQuestion Answering | CodeCode Available | 1 | 5 |
| ChiQA: A Large Scale Image-based Real-World Question Answering Dataset for Multi-Modal Understanding | Aug 5, 2022 | Image RetrievalQuestion Answering | CodeCode Available | 1 | 5 |
| Self-supervised vision-language pretraining for Medical visual question answering | Nov 24, 2022 | Contrastive LearningImage-text matching | CodeCode Available | 1 | 5 |
| MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research | Mar 17, 2025 | ArticlesBenchmarking | CodeCode Available | 1 | 5 |
| MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents | May 26, 2025 | BenchmarkingMinecraft | CodeCode Available | 1 | 5 |
| DocVQA: A Dataset for VQA on Document Images | Jul 1, 2020 | Question AnsweringReading Comprehension | CodeCode Available | 1 | 5 |
| MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource Visual Question Answering | Mar 2, 2023 | Mixture-of-ExpertsQuestion Answering | CodeCode Available | 1 | 5 |
| Boosting Audio Visual Question Answering via Key Semantic-Aware Cues | Jul 30, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 1 | 5 |
| Learning Trimodal Relation for AVQA with Missing Modality | Jul 23, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 1 | 5 |
| Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving | Mar 27, 2025 | AttributeAutonomous Driving | CodeCode Available | 1 | 5 |
| Does Vision-and-Language Pretraining Improve Lexical Grounding? | Sep 21, 2021 | Question AnsweringVisual Question Answering | CodeCode Available | 1 | 5 |
| ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification | Apr 29, 2025 | DiagnosticQuestion Answering | CodeCode Available | 1 | 5 |
| Check It Again:Progressive Visual Question Answering via Visual Entailment | Aug 1, 2021 | Question AnsweringVisual Entailment | CodeCode Available | 1 | 5 |
| Check It Again: Progressive Visual Question Answering via Visual Entailment | Jun 8, 2021 | Question AnsweringVisual Entailment | CodeCode Available | 1 | 5 |
| Fine-grained Image Classification and Retrieval by Combining Visual and Locally Pooled Textual Features | Jan 14, 2020 | ClassificationDiversity | CodeCode Available | 1 | 5 |
| MemeCap: A Dataset for Captioning and Interpreting Memes | May 23, 2023 | Image CaptioningMeme Captioning | CodeCode Available | 1 | 5 |
| ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model | Feb 20, 2025 | Mixture-of-ExpertsQuestion Answering | CodeCode Available | 1 | 5 |
| MediConfusion: Can you trust your AI radiologist? Probing the reliability of multimodal medical foundation models | Sep 23, 2024 | Medical Visual Question AnsweringQuestion Answering | CodeCode Available | 1 | 5 |
| MedMax: Mixed-Modal Instruction Tuning for Training Biomedical Assistants | Dec 17, 2024 | Image CaptioningQuestion Answering | CodeCode Available | 1 | 5 |
| Change Detection Meets Visual Question Answering | Dec 12, 2021 | Answer GenerationChange Detection | CodeCode Available | 1 | 5 |
| Faithful Multimodal Explanation for Visual Question Answering | Sep 8, 2018 | Explanatory Visual Question AnsweringQuestion Answering | CodeCode Available | 1 | 5 |
| OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge | May 31, 2019 | object-detectionObject Detection | CodeCode Available | 1 | 5 |
| Expressive Scene Graph Generation Using Commonsense Knowledge Infusion for Visual Understanding and Reasoning | May 31, 2022 | Common Sense ReasoningGraph Generation | CodeCode Available | 1 | 5 |
| AI2-THOR: An Interactive 3D Environment for Visual AI | Dec 14, 2017 | Deep Reinforcement LearningImitation Learning | CodeCode Available | 1 | 5 |
| FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs | Mar 27, 2025 | AttributeBenchmarking | CodeCode Available | 1 | 5 |