| LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning | Mar 19, 2025 | Instruction FollowingMultimodal Reasoning | CodeCode Available | 2 |
| Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning? | May 27, 2025 | Multimodal Reasoning | CodeCode Available | 2 |
| VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning | Apr 10, 2025 | MathMultimodal Reasoning | CodeCode Available | 2 |
| Neptune: The Long Orbit to Benchmarking Long Video Understanding | Dec 12, 2024 | BenchmarkingMultimodal Reasoning | CodeCode Available | 2 |
| DC3DO: Diffusion Classifier for 3D Objects | Aug 13, 2024 | 3D Object ClassificationClassification | CodeCode Available | 1 |
| Thinking Before Looking: Improving Multimodal LLM Reasoning via Mitigating Visual Hallucination | Nov 15, 2024 | HallucinationMultimodal Reasoning | CodeCode Available | 1 |
| Will Pre-Training Ever End? A First Step Toward Next-Generation Foundation MLLMs via Self-Improving Systematic Cognition | Mar 16, 2025 | Caption GenerationImage Captioning | CodeCode Available | 1 |
| Stop Reasoning! When Multimodal LLM with Chain-of-Thought Reasoning Meets Adversarial Image | Feb 22, 2024 | Adversarial RobustnessMultimodal Reasoning | CodeCode Available | 1 |
| Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start | May 28, 2025 | MathMultimodal Reasoning | CodeCode Available | 1 |
| Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings | Nov 29, 2024 | Multimodal Reasoning | CodeCode Available | 1 |
| ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark | May 22, 2025 | document understandingMultimodal Reasoning | CodeCode Available | 1 |
| Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision | Aug 12, 2021 | 3D geometryDescriptive | CodeCode Available | 1 |
| CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models | Dec 17, 2024 | Multimodal Reasoning | CodeCode Available | 1 |
| CofiPara: A Coarse-to-fine Paradigm for Multimodal Sarcasm Target Identification with Large Multimodal Models | May 1, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding | Mar 29, 2022 | Multimodal ReasoningVisual Grounding | CodeCode Available | 1 |
| Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities | Feb 17, 2025 | Code GenerationHumanEval | CodeCode Available | 1 |
| SATORI-R1: Incentivizing Multimodal Reasoning with Spatial Grounding and Verifiable Rewards | May 25, 2025 | Image CaptioningMultimodal Reasoning | CodeCode Available | 1 |
| A Picture Is Worth a Graph: A Blueprint Debate Paradigm for Multimodal Reasoning | Mar 22, 2024 | Multimodal Reasoning | CodeCode Available | 1 |
| Question-Aware Gaussian Experts for Audio-Visual Question Answering | Mar 6, 2025 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 1 |
| e-SNLI-VE: Corrected Visual-Textual Entailment with Natural Language Explanations | Apr 7, 2020 | Multimodal ReasoningNatural Language Inference | CodeCode Available | 1 |
| Fine-Grained Visual Entailment | Mar 29, 2022 | Multimodal ReasoningVisual Entailment | CodeCode Available | 1 |
| SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information | May 19, 2025 | FairnessMultimodal Reasoning | CodeCode Available | 1 |
| Variational Causal Inference Network for Explanatory Visual Question Answering | Jan 1, 2023 | Explanation GenerationExplanatory Visual Question Answering | CodeCode Available | 1 |
| Oasis: One Image is All You Need for Multimodal Instruction Data Synthesis | Mar 11, 2025 | AllDataset Generation | CodeCode Available | 1 |
| A Multimodal Framework for the Detection of Hateful Memes | Dec 23, 2020 | Ensemble LearningMultimodal Reasoning | CodeCode Available | 1 |
| PACS: A Dataset for Physical Audiovisual CommonSense Reasoning | Mar 21, 2022 | Common Sense ReasoningMultimodal Reasoning | CodeCode Available | 1 |
| Exploring the Transferability of Visual Prompting for Multimodal Large Language Models | Apr 17, 2024 | HallucinationMultimodal Reasoning | CodeCode Available | 1 |
| MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning | Jun 5, 2025 | Dataset GenerationMathematical Problem-Solving | CodeCode Available | 1 |
| Breaking the Data Barrier -- Building GUI Agents Through Task Generalization | Apr 14, 2025 | Mathematical ReasoningMultimodal Reasoning | CodeCode Available | 1 |
| Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-Consistency Training | Nov 23, 2023 | Multimodal ReasoningScience Question Answering | CodeCode Available | 1 |
| MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks | Oct 13, 2023 | multimodal interactionMultimodal Reasoning | CodeCode Available | 1 |
| Boosting MLLM Reasoning with Text-Debiased Hint-GRPO | Mar 31, 2025 | Mathematical ReasoningMultimodal Reasoning | CodeCode Available | 1 |
| MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research | Mar 17, 2025 | ArticlesBenchmarking | CodeCode Available | 1 |
| All in an Aggregated Image for In-Image Learning | Feb 28, 2024 | AllHallucination | CodeCode Available | 1 |
| MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models | Jun 17, 2024 | BenchmarkingFact Checking | CodeCode Available | 1 |
| MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification | Feb 19, 2025 | Multimodal Reasoning | CodeCode Available | 1 |
| Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning | Aug 16, 2024 | MathMathematical Reasoning | CodeCode Available | 1 |
| MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models | Apr 8, 2025 | MathMultimodal Reasoning | CodeCode Available | 1 |
| Beneath the Surface: Unveiling Harmful Memes with Multimodal Reasoning Distilled from Large Language Models | Dec 9, 2023 | Multimodal Reasoning | CodeCode Available | 1 |
| DOMINO: A Dual-System for Multi-step Visual Language Reasoning | Oct 4, 2023 | Arithmetic ReasoningLanguage Modeling | CodeCode Available | 1 |
| Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks | May 30, 2025 | Autonomous DrivingMath | CodeCode Available | 1 |
| Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework | Jul 24, 2023 | Contrastive LearningMultimodal Reasoning | CodeCode Available | 1 |
| Do Language Models Understand Time? | Dec 18, 2024 | Action RecognitionAnomaly Detection | CodeCode Available | 1 |
| LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images? | May 18, 2025 | Logical ReasoningMultimodal Reasoning | CodeCode Available | 1 |
| LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and Generation | May 19, 2023 | Image GenerationInstruction Following | CodeCode Available | 1 |
| MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale | Dec 6, 2024 | Multimodal ReasoningVisual Question Answering | CodeCode Available | 1 |
| 3MDBench: Medical Multimodal Multi-agent Dialogue Benchmark | Mar 26, 2025 | DiagnosticMultimodal Reasoning | CodeCode Available | 1 |
| MERLOT: Multimodal Neural Script Knowledge Models | Jun 4, 2021 | Multimodal ReasoningVisual Commonsense Reasoning | CodeCode Available | 1 |
| Learning Compact Vision Tokens for Efficient Large Multimodal Models | Jun 8, 2025 | Multimodal ReasoningToken Reduction | CodeCode Available | 1 |
| HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning | Jul 22, 2024 | BenchmarkingHallucination | CodeCode Available | 1 |