| Emerging Properties in Unified Multimodal Pretraining | May 20, 2025 | Image Editing | CodeCode Available | 9 |
| Skywork-R1V3 Technical Report | Jul 8, 2025 | cross-modal alignmentMathematical Reasoning | CodeCode Available | 7 |
| Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning | Apr 23, 2025 | Multimodal Reasoningreinforcement-learning | CodeCode Available | 7 |
| Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought | Apr 8, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 7 |
| GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning | Jul 1, 2025 | document understandingMultimodal Reasoning | CodeCode Available | 7 |
| LLaVA-CoT: Let Vision Language Models Reason Step-by-Step | Nov 15, 2024 | Logical ReasoningMultimodal Reasoning | CodeCode Available | 7 |
| Kimi-VL Technical Report | Apr 10, 2025 | Long-Context UnderstandingMathematical Reasoning | CodeCode Available | 5 |
| Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models | Mar 9, 2025 | MathMultimodal Reasoning | CodeCode Available | 5 |
| DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning | May 20, 2025 | HallucinationMathematical Reasoning | CodeCode Available | 5 |
| Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers | Jun 30, 2025 | Multimodal Reasoning | CodeCode Available | 5 |
| R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model | Mar 7, 2025 | Multimodal Reasoningreinforcement-learning | CodeCode Available | 4 |
| R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization | Mar 13, 2025 | Multimodal Reasoning | CodeCode Available | 4 |
| MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision | May 19, 2025 | MathMathematical Reasoning | CodeCode Available | 4 |
| Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models | May 8, 2025 | Multimodal Reasoning | CodeCode Available | 4 |
| R1-Onevision:An Open-Source Multimodal Large Language Model Capable of Deep Reasoning | Feb 24, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 4 |
| MiMo-VL Technical Report | Jun 4, 2025 | Multimodal Reasoning | CodeCode Available | 4 |
| LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL | Mar 10, 2025 | Logical ReasoningMultimodal Reasoning | CodeCode Available | 4 |
| MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning | Mar 10, 2025 | Multimodal ReasoningReinforcement Learning (RL) | CodeCode Available | 4 |
| Affordable AI Assistants with Knowledge Graph of Thoughts | Apr 3, 2025 | Knowledge GraphsLLM real-life tasks | CodeCode Available | 3 |
| Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation | Feb 12, 2025 | cross-modal alignmentmultimodal generation | CodeCode Available | 3 |
| Beyond Chain-of-Thought, Effective Graph-of-Thought Reasoning in Language Models | May 26, 2023 | GSM8KMultimodal Reasoning | CodeCode Available | 3 |
| Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction | Dec 5, 2024 | Multimodal ReasoningNatural Language Visual Grounding | CodeCode Available | 3 |
| DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge | Jul 6, 2025 | Image GenerationMultimodal Reasoning | CodeCode Available | 3 |
| Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens | Jun 20, 2025 | Image GenerationMultimodal Reasoning | CodeCode Available | 3 |
| ChartGalaxy: A Dataset for Infographic Chart Understanding and Generation | May 24, 2025 | BenchmarkingChart Understanding | CodeCode Available | 3 |
| Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models | Mar 4, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 3 |
| MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models | Aug 2, 2024 | Multimodal ReasoningMultiple-choice | CodeCode Available | 3 |
| Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious Challenges in Multimodal Reasoning | Mar 6, 2024 | Multimodal ReasoningQuestion Answering | CodeCode Available | 2 |
| HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context | Jun 26, 2025 | Large Language ModelMultimodal Reasoning | CodeCode Available | 2 |
| HM-RAG: Hierarchical Multi-Agent Multimodal Retrieval Augmented Generation | Apr 13, 2025 | Multimodal ReasoningRAG | CodeCode Available | 2 |
| Code2Logic: Game-Code-Driven Data Synthesis for Enhancing VLMs General Reasoning | May 20, 2025 | Domain GeneralizationMultimodal Reasoning | CodeCode Available | 2 |
| FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning | Apr 1, 2025 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 2 |
| Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing | Jun 11, 2025 | Multimodal ReasoningSpatial Reasoning | CodeCode Available | 2 |
| The Jumping Reasoning Curve? Tracking the Evolution of Reasoning Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles | Feb 3, 2025 | ARCMultimodal Reasoning | CodeCode Available | 2 |
| OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement | Mar 21, 2025 | Multimodal ReasoningReinforcement Learning (RL) | CodeCode Available | 2 |
| Efficient Reasoning with Hidden Thinking | Jan 31, 2025 | DecoderMultimodal Reasoning | CodeCode Available | 2 |
| Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning | Apr 17, 2025 | Multimodal ReasoningReinforcement Learning (RL) | CodeCode Available | 2 |
| Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner | May 16, 2025 | Cross-Modal RetrievalDiagnostic | CodeCode Available | 2 |
| Multimodal Analogical Reasoning over Knowledge Graphs | Oct 1, 2022 | Graph EmbeddingKnowledge Graph Embedding | CodeCode Available | 2 |
| Neptune: The Long Orbit to Benchmarking Long Video Understanding | Dec 12, 2024 | BenchmarkingMultimodal Reasoning | CodeCode Available | 2 |
| Play to Generalize: Learning to Reason Through Game Play | Jun 9, 2025 | Domain GeneralizationMath | CodeCode Available | 2 |
| MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action | Mar 20, 2023 | Multimodal ReasoningVisual Question Answering | CodeCode Available | 2 |
| Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset | Feb 22, 2024 | DiversityMath | CodeCode Available | 2 |
| Can Atomic Step Decomposition Enhance the Self-structured Reasoning of Multimodal Large Models? | Mar 8, 2025 | Mathematical ReasoningMultimodal Reasoning | CodeCode Available | 2 |
| Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents | May 30, 2025 | BenchmarkingBlocking | CodeCode Available | 2 |
| DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding | Mar 17, 2025 | Domain GeneralizationMultimodal Reasoning | CodeCode Available | 2 |
| Distill Visual Chart Reasoning Ability from LLMs to MLLMs | Oct 24, 2024 | Multimodal ReasoningVisual Reasoning | CodeCode Available | 2 |
| CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion | Feb 8, 2024 | Computational EfficiencyMultimodal Reasoning | CodeCode Available | 2 |
| Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering | Sep 20, 2022 | Multimodal Deep LearningMultimodal Reasoning | CodeCode Available | 2 |
| LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning | Mar 19, 2025 | Instruction FollowingMultimodal Reasoning | CodeCode Available | 2 |