| Multimodal Co-Attention Transformer for Survival Prediction in Gigapixel Whole Slide Images | Jan 1, 2021 | AttributeMultiple Instance Learning | CodeCode Available | 1 | 5 |
| ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model | Feb 20, 2025 | Mixture-of-ExpertsQuestion Answering | CodeCode Available | 1 | 5 |
| Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer | Feb 18, 2021 | DecoderDocument Image Classification | CodeCode Available | 1 | 5 |
| Good Questions Help Zero-Shot Image Reasoning | Dec 4, 2023 | Fine-Grained Image ClassificationQuestion Answering | CodeCode Available | 1 | 5 |
| MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models | Jun 17, 2024 | BenchmarkingFact Checking | CodeCode Available | 1 | 5 |
| Uncertainty-Aware Evaluation for Vision-Language Models | Feb 22, 2024 | Conformal PredictionLanguage Modeling | CodeCode Available | 1 | 5 |
| Multi-modal Auto-regressive Modeling via Visual Words | Mar 12, 2024 | Visual Question AnsweringVisual Question Answering (VQA) | CodeCode Available | 1 | 5 |
| Multimodal Federated Learning via Contrastive Representation Ensemble | Feb 17, 2023 | Federated LearningImage-text Retrieval | CodeCode Available | 1 | 5 |
| MemeCap: A Dataset for Captioning and Interpreting Memes | May 23, 2023 | Image CaptioningMeme Captioning | CodeCode Available | 1 | 5 |
| Change Detection Meets Visual Question Answering | Dec 12, 2021 | Answer GenerationChange Detection | CodeCode Available | 1 | 5 |
| OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge | May 31, 2019 | object-detectionObject Detection | CodeCode Available | 1 | 5 |
| Global and Local Semantic Completion Learning for Vision-Language Pre-training | Jun 12, 2023 | cross-modal alignmentImage-text Retrieval | CodeCode Available | 1 | 5 |
| Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator | Dec 11, 2023 | Image CaptioningQuestion Answering | CodeCode Available | 1 | 5 |
| AI2-THOR: An Interactive 3D Environment for Visual AI | Dec 14, 2017 | Deep Reinforcement LearningImitation Learning | CodeCode Available | 1 | 5 |
| GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and reusing ModulEs | Nov 8, 2023 | Question AnsweringReferring Expression | CodeCode Available | 1 | 5 |
| GMAI-VL-R1: Harnessing Reinforcement Learning for Multimodal Medical Reasoning | Apr 2, 2025 | Decision MakingDiagnostic | CodeCode Available | 1 | 5 |
| GraghVQA: Language-Guided Graph Neural Networks for Graph-based Visual Question Answering | Apr 20, 2021 | Graph Neural NetworkGraph Question Answering | CodeCode Available | 1 | 5 |
| Multimodal fusion of imaging and genomics for lung cancer recurrence prediction | Feb 5, 2020 | Computed Tomography (CT)Question Answering | CodeCode Available | 1 | 5 |
| NuScenes-MQA: Integrated Evaluation of Captions and QA for Autonomous Driving Datasets using Markup Annotations | Dec 11, 2023 | Autonomous DrivingDescriptive | CodeCode Available | 1 | 5 |
| Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs? | Jan 5, 2025 | Image CaptioningImage to text | CodeCode Available | 1 | 5 |
| Generative Bias for Robust Visual Question Answering | Aug 1, 2022 | Knowledge DistillationQuestion Answering | CodeCode Available | 1 | 5 |
| MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering | Mar 17, 2022 | Implicit RelationsQuestion Answering | CodeCode Available | 1 | 5 |
| Gemini: A Family of Highly Capable Multimodal Models | Dec 19, 2023 | 1 Image, 2*2 StitchingArithmetic Reasoning | CodeCode Available | 1 | 5 |
| Gemini Goes to Med School: Exploring the Capabilities of Multimodal Large Language Models on Medical Challenge Problems & Hallucinations | Feb 10, 2024 | DiagnosticHallucination | CodeCode Available | 1 | 5 |
| Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot Visual Question Answering | Mar 21, 2024 | object-detectionObject Detection | CodeCode Available | 1 | 5 |
| CausalChaos! Dataset for Comprehensive Causal Action Question Answering Over Longer Causal Chains Grounded in Dynamic Visual Scenes | Apr 1, 2024 | Causal DiscoveryCausal Discovery in Video Reasoning | CodeCode Available | 1 | 5 |
| Gated Hierarchical Attention for Image Captioning | Oct 30, 2018 | DecoderImage Captioning | CodeCode Available | 1 | 5 |
| A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common Sense Reasoning | Oct 1, 2024 | Common Sense ReasoningDeepFake Detection | CodeCode Available | 1 | 5 |
| MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems | Oct 18, 2024 | BenchmarkingQuestion Answering | CodeCode Available | 1 | 5 |
| Can We Talk Models Into Seeing the World Differently? | Mar 14, 2024 | Image CaptioningImage Classification | CodeCode Available | 1 | 5 |
| mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections | May 24, 2022 | Computational Efficiencycross-modal alignment | CodeCode Available | 1 | 5 |
| MMXU: A Multi-Modal and Multi-X-ray Understanding Dataset for Disease Progression | Feb 17, 2025 | DiagnosticQuestion Answering | CodeCode Available | 1 | 5 |
| Found a Reason for me? Weakly-supervised Grounded Visual Question Answering using Capsules | May 11, 2021 | Question AnsweringVisual Question Answering | CodeCode Available | 1 | 5 |
| Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering | May 25, 2025 | AnatomyBenchmarking | CodeCode Available | 1 | 5 |
| 3D-Aware Visual Question Answering about Parts, Poses and Occlusions | Oct 27, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 1 | 5 |
| Florence: A New Foundation Model for Computer Vision | Nov 22, 2021 | Action ClassificationAction Recognition | CodeCode Available | 1 | 5 |
| Modular Visual Question Answering via Code Generation | Jun 8, 2023 | Code GenerationIn-Context Learning | CodeCode Available | 1 | 5 |
| Fine-grained Image Classification and Retrieval by Combining Visual and Locally Pooled Textual Features | Jan 14, 2020 | ClassificationDiversity | CodeCode Available | 1 | 5 |
| EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models | Nov 27, 2023 | AttributeQuestion Answering | CodeCode Available | 1 | 5 |
| Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving | Mar 27, 2025 | AttributeAutonomous Driving | CodeCode Available | 1 | 5 |
| Faithful Multimodal Explanation for Visual Question Answering | Sep 8, 2018 | Explanatory Visual Question AnsweringQuestion Answering | CodeCode Available | 1 | 5 |
| EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images | Oct 28, 2023 | Decision MakingMedical Visual Question Answering | CodeCode Available | 1 | 5 |
| MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering | Oct 27, 2020 | DiagnosticQuestion Answering | CodeCode Available | 1 | 5 |
| Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? | Feb 23, 2023 | Open-Domain Question AnsweringQuestion Answering | CodeCode Available | 1 | 5 |
| FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs | Mar 27, 2025 | AttributeBenchmarking | CodeCode Available | 1 | 5 |
| Expressive Scene Graph Generation Using Commonsense Knowledge Infusion for Visual Understanding and Reasoning | May 31, 2022 | Common Sense ReasoningGraph Generation | CodeCode Available | 1 | 5 |
| Explaining Autonomous Driving Actions with Visual Question Answering | Jul 19, 2023 | Autonomous DrivingAutonomous Vehicles | CodeCode Available | 1 | 5 |
| Expert Knowledge-Aware Image Difference Graph Representation Learning for Difference-Aware Medical Visual Question Answering | Jul 22, 2023 | Graph Representation LearningLanguage Modeling | CodeCode Available | 1 | 5 |
| Foundation Model is Efficient Multimodal Multitask Model Selector | Aug 11, 2023 | modelModel Selection | CodeCode Available | 1 | 5 |
| GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection | Nov 5, 2023 | Anomaly DetectionQuestion Answering | CodeCode Available | 1 | 5 |