| LaPA: Latent Prompt Assist Model For Medical Visual Question Answering | Apr 19, 2024 | Medical Visual Question AnsweringQuestion Answering | CodeCode Available | 1 | 5 |
| Learning to Discretely Compose Reasoning Module Networks for Video Captioning | Jul 17, 2020 | DecoderQuestion Answering | CodeCode Available | 1 | 5 |
| MISS: A Generative Pretraining and Finetuning Approach for Med-VQA | Jan 10, 2024 | Medical Visual Question AnsweringMulti-Task Learning | CodeCode Available | 1 | 5 |
| Detecting Hate Speech in Multi-modal Memes | Dec 29, 2020 | Binary ClassificationHate Speech Detection | CodeCode Available | 1 | 5 |
| BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs | Mar 2, 2023 | ArticlesMedical Visual Question Answering | CodeCode Available | 1 | 5 |
| Large-Scale Adversarial Training for Vision-and-Language Representation Learning | Jun 11, 2020 | Image-text RetrievalQuestion Answering | CodeCode Available | 1 | 5 |
| DeVLBert: Learning Deconfounded Visio-Linguistic Representations | Aug 16, 2020 | Image RetrievalQuestion Answering | CodeCode Available | 1 | 5 |
| Large Scale Multimodal Classification Using an Ensemble of Transformer Models and Co-Attention | Nov 23, 2020 | ClassificationGeneral Classification | CodeCode Available | 1 | 5 |
| CLEVR-Math: A Dataset for Compositional Language, Visual and Mathematical Reasoning | Aug 10, 2022 | MathMathematical Reasoning | CodeCode Available | 1 | 5 |
| Mitigating Hallucinations in Vision-Language Models through Image-Guided Head Suppression | May 22, 2025 | HallucinationImage Description | CodeCode Available | 1 | 5 |
| CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning | Dec 20, 2016 | DiagnosticQuestion Answering | CodeCode Available | 1 | 5 |
| REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering | Jun 2, 2022 | Question AnsweringRetrieval | CodeCode Available | 1 | 5 |
| Comprehensive Visual Question Answering on Point Clouds through Compositional Scene Manipulation | Dec 22, 2021 | Common Sense ReasoningQuestion Answering | CodeCode Available | 1 | 5 |
| MixGen: A New Multi-Modal Data Augmentation | Jun 16, 2022 | Data AugmentationImage-text Retrieval | CodeCode Available | 1 | 5 |
| Found a Reason for me? Weakly-supervised Grounded Visual Question Answering using Capsules | May 11, 2021 | Question AnsweringVisual Question Answering | CodeCode Available | 1 | 5 |
| Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models | Mar 25, 2025 | BenchmarkingImage Captioning | CodeCode Available | 1 | 5 |
| Foundation Model is Efficient Multimodal Multitask Model Selector | Aug 11, 2023 | modelModel Selection | CodeCode Available | 1 | 5 |
| R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering | May 24, 2018 | Question AnsweringRelation | CodeCode Available | 1 | 5 |
| Disentangling 3D Prototypical Networks For Few-Shot Concept Learning | Nov 6, 2020 | 3D geometry3D Object Detection | CodeCode Available | 1 | 5 |
| SATORI-R1: Incentivizing Multimodal Reasoning with Spatial Grounding and Verifiable Rewards | May 25, 2025 | Image CaptioningMultimodal Reasoning | CodeCode Available | 1 | 5 |
| Mind Your Outliers! Investigating the Negative Impact of Outliers on Active Learning for Visual Question Answering | Jul 6, 2021 | Active LearningObject Recognition | CodeCode Available | 1 | 5 |
| Distilled Dual-Encoder Model for Vision-Language Understanding | Dec 16, 2021 | Image to textmodel | CodeCode Available | 1 | 5 |
| FloodNet: A High Resolution Aerial Imagery Dataset for Post Flood Scene Understanding | Dec 5, 2020 | image-classificationImage Classification | CodeCode Available | 1 | 5 |
| Learning Situation Hyper-Graphs for Video Question Answering | Apr 18, 2023 | DecoderQuestion Answering | CodeCode Available | 1 | 5 |
| Florence: A New Foundation Model for Computer Vision | Nov 22, 2021 | Action ClassificationAction Recognition | CodeCode Available | 1 | 5 |
| Learning to Answer Visual Questions from Web Videos | May 10, 2022 | Dataset GenerationQuestion Answering | CodeCode Available | 1 | 5 |
| ChiQA: A Large Scale Image-based Real-World Question Answering Dataset for Multi-Modal Understanding | Aug 5, 2022 | Image RetrievalQuestion Answering | CodeCode Available | 1 | 5 |
| Self-supervised vision-language pretraining for Medical visual question answering | Nov 24, 2022 | Contrastive LearningImage-text matching | CodeCode Available | 1 | 5 |
| MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research | Mar 17, 2025 | ArticlesBenchmarking | CodeCode Available | 1 | 5 |
| MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents | May 26, 2025 | BenchmarkingMinecraft | CodeCode Available | 1 | 5 |
| DocVQA: A Dataset for VQA on Document Images | Jul 1, 2020 | Question AnsweringReading Comprehension | CodeCode Available | 1 | 5 |
| MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource Visual Question Answering | Mar 2, 2023 | Mixture-of-ExpertsQuestion Answering | CodeCode Available | 1 | 5 |
| Boosting Audio Visual Question Answering via Key Semantic-Aware Cues | Jul 30, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 1 | 5 |
| Learning Trimodal Relation for AVQA with Missing Modality | Jul 23, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 1 | 5 |
| Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving | Mar 27, 2025 | AttributeAutonomous Driving | CodeCode Available | 1 | 5 |
| Does Vision-and-Language Pretraining Improve Lexical Grounding? | Sep 21, 2021 | Question AnsweringVisual Question Answering | CodeCode Available | 1 | 5 |
| ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification | Apr 29, 2025 | DiagnosticQuestion Answering | CodeCode Available | 1 | 5 |
| Check It Again:Progressive Visual Question Answering via Visual Entailment | Aug 1, 2021 | Question AnsweringVisual Entailment | CodeCode Available | 1 | 5 |
| Check It Again: Progressive Visual Question Answering via Visual Entailment | Jun 8, 2021 | Question AnsweringVisual Entailment | CodeCode Available | 1 | 5 |
| Fine-grained Image Classification and Retrieval by Combining Visual and Locally Pooled Textual Features | Jan 14, 2020 | ClassificationDiversity | CodeCode Available | 1 | 5 |
| MemeCap: A Dataset for Captioning and Interpreting Memes | May 23, 2023 | Image CaptioningMeme Captioning | CodeCode Available | 1 | 5 |
| ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model | Feb 20, 2025 | Mixture-of-ExpertsQuestion Answering | CodeCode Available | 1 | 5 |
| MediConfusion: Can you trust your AI radiologist? Probing the reliability of multimodal medical foundation models | Sep 23, 2024 | Medical Visual Question AnsweringQuestion Answering | CodeCode Available | 1 | 5 |
| MedMax: Mixed-Modal Instruction Tuning for Training Biomedical Assistants | Dec 17, 2024 | Image CaptioningQuestion Answering | CodeCode Available | 1 | 5 |
| Change Detection Meets Visual Question Answering | Dec 12, 2021 | Answer GenerationChange Detection | CodeCode Available | 1 | 5 |
| Faithful Multimodal Explanation for Visual Question Answering | Sep 8, 2018 | Explanatory Visual Question AnsweringQuestion Answering | CodeCode Available | 1 | 5 |
| OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge | May 31, 2019 | object-detectionObject Detection | CodeCode Available | 1 | 5 |
| Expressive Scene Graph Generation Using Commonsense Knowledge Infusion for Visual Understanding and Reasoning | May 31, 2022 | Common Sense ReasoningGraph Generation | CodeCode Available | 1 | 5 |
| AI2-THOR: An Interactive 3D Environment for Visual AI | Dec 14, 2017 | Deep Reinforcement LearningImitation Learning | CodeCode Available | 1 | 5 |
| FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs | Mar 27, 2025 | AttributeBenchmarking | CodeCode Available | 1 | 5 |