| Med-Flamingo: a Multimodal Medical Few-shot Learner | Jul 27, 2023 | Medical Visual Question AnsweringQuestion Answering | CodeCode Available | 2 |
| GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest | Jul 7, 2023 | AttributeCommon Sense Reasoning | CodeCode Available | 2 |
| JourneyDB: A Benchmark for Generative Image Understanding | Jul 3, 2023 | Image CaptioningImage Comprehension | CodeCode Available | 2 |
| Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic | Jun 27, 2023 | Image CaptioningReferring Expression Segmentation | CodeCode Available | 2 |
| Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning | Jun 26, 2023 | HallucinationVisual Question Answering | CodeCode Available | 2 |
| LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models | Jun 15, 2023 | HallucinationImage Captioning | CodeCode Available | 2 |
| BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks | May 26, 2023 | Image CaptioningMedical Visual Question Answering | CodeCode Available | 2 |
| NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario | May 24, 2023 | Autonomous DrivingQuestion Answering | CodeCode Available | 2 |
| OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models | May 13, 2023 | Key Information ExtractionNutrition | CodeCode Available | 2 |
| InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning | May 11, 2023 | 1 Image, 2*2 StitchingDiversity | CodeCode Available | 2 |
| MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action | Mar 20, 2023 | Multimodal ReasoningVisual Question Answering | CodeCode Available | 2 |
| PaLM-E: An Embodied Multimodal Language Model | Mar 6, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| Prophet: Prompting Large Language Models with Complementary Answer Heuristics for Knowledge-based Visual Question Answering | Mar 3, 2023 | Language ModellingLarge Language Model | CodeCode Available | 2 |
| Visual Programming: Compositional visual reasoning without training | Nov 18, 2022 | In-Context LearningQuestion Answering | CodeCode Available | 2 |
| PoseScript: Linking 3D Human Poses and Natural Language | Oct 21, 2022 | Cross-Modal RetrievalImage Captioning | CodeCode Available | 2 |
| Retrieval Augmented Visual Question Answering with Outside Knowledge | Oct 7, 2022 | Answer GenerationDiagnostic | CodeCode Available | 2 |
| Vision-Language Pre-Training with Triple Contrastive Learning | Feb 21, 2022 | Contrastive Learningcross-modal alignment | CodeCode Available | 2 |
| MDETR - Modulated Detection for End-to-End Multi-Modal Understanding | Jan 1, 2021 | Phrase GroundingQuestion Answering | CodeCode Available | 2 |
| Unified Vision-Language Pre-Training for Image Captioning and VQA | Sep 24, 2019 | DecoderImage Captioning | CodeCode Available | 2 |
| Describe Anything Model for Visual Question Answering on Text-rich Images | Jul 16, 2025 | DescriptiveLanguage Modeling | CodeCode Available | 1 |
| SimpleDoc: Multi-Modal Document Understanding with Dual-Cue Page Retrieval and Iterative Refinement | Jun 16, 2025 | document understandingQuestion Answering | CodeCode Available | 1 |
| Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification | Jun 8, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| VideoCAD: A Large-Scale Video Dataset for Learning UI Interactions and 3D Reasoning from CAD Software | May 30, 2025 | Question AnsweringSpatial Reasoning | CodeCode Available | 1 |
| Interpreting Chest X-rays Like a Radiologist: A Benchmark with Clinical Reasoning | May 29, 2025 | DiagnosticQuestion Answering | CodeCode Available | 1 |
| MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding | May 26, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents | May 26, 2025 | BenchmarkingMinecraft | CodeCode Available | 1 |
| Visualized Text-to-Image Retrieval | May 26, 2025 | Image RetrievalQuestion Answering | CodeCode Available | 1 |
| SATORI-R1: Incentivizing Multimodal Reasoning with Spatial Grounding and Verifiable Rewards | May 25, 2025 | Image CaptioningMultimodal Reasoning | CodeCode Available | 1 |
| Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering | May 25, 2025 | AnatomyBenchmarking | CodeCode Available | 1 |
| VEAttack: Downstream-agnostic Vision Encoder Attack against Large Vision Language Models | May 23, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering | May 22, 2025 | BenchmarkingEvidence Selection | CodeCode Available | 1 |
| Mitigating Hallucinations in Vision-Language Models through Image-Guided Head Suppression | May 22, 2025 | HallucinationImage Description | CodeCode Available | 1 |
| Reasoning-OCR: Can Large Multimodal Models Solve Complex Logical Reasoning Problems from OCR Cues? | May 19, 2025 | Logical ReasoningOptical Character Recognition | CodeCode Available | 1 |
| MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks | May 18, 2025 | BenchmarkingMedical Visual Question Answering | CodeCode Available | 1 |
| UniBiomed: A Universal Foundation Model for Grounded Biomedical Image Interpretation | Apr 30, 2025 | DiagnosticLarge Language Model | CodeCode Available | 1 |
| ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification | Apr 29, 2025 | DiagnosticQuestion Answering | CodeCode Available | 1 |
| Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency | Apr 24, 2025 | BenchmarkingMath | CodeCode Available | 1 |
| ReasonDrive: Efficient Visual Question Answering for Autonomous Vehicles with Reasoning-Enhanced Small Vision-Language Models | Apr 14, 2025 | Autonomous DrivingAutonomous Vehicles | CodeCode Available | 1 |
| A Survey on Efficient Vision-Language Models | Apr 13, 2025 | Image CaptioningQuestion Answering | CodeCode Available | 1 |
| STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection | Apr 3, 2025 | Instruction FollowingLanguage Modeling | CodeCode Available | 1 |
| GMAI-VL-R1: Harnessing Reinforcement Learning for Multimodal Medical Reasoning | Apr 2, 2025 | Decision MakingDiagnostic | CodeCode Available | 1 |
| FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs | Mar 27, 2025 | AttributeBenchmarking | CodeCode Available | 1 |
| Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving | Mar 27, 2025 | AttributeAutonomous Driving | CodeCode Available | 1 |
| PAVE: Patching and Adapting Video Large Language Models | Mar 25, 2025 | Audio-visual Question AnsweringMulti-Task Learning | CodeCode Available | 1 |
| Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models | Mar 25, 2025 | BenchmarkingImage Captioning | CodeCode Available | 1 |
| MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research | Mar 17, 2025 | ArticlesBenchmarking | CodeCode Available | 1 |
| NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models | Mar 17, 2025 | Question AnsweringScene Understanding | CodeCode Available | 1 |
| How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game | Mar 13, 2025 | Multimodal ReasoningQuestion Answering | CodeCode Available | 1 |
| Question-Aware Gaussian Experts for Audio-Visual Question Answering | Mar 6, 2025 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 1 |
| ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model | Feb 20, 2025 | Mixture-of-ExpertsQuestion Answering | CodeCode Available | 1 |