| Seeing is Believing, but How Much? A Comprehensive Analysis of Verbalized Calibration in Vision-Language Models | May 26, 2025 | Uncertainty QuantificationVisual Reasoning | —Unverified | 0 |
| SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning | May 25, 2025 | BenchmarkingVisual Reasoning | CodeCode Available | 1 |
| VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use | May 25, 2025 | Multimodal ReasoningQuestion Answering | CodeCode Available | 2 |
| The Eye of Sherlock Holmes: Uncovering User Private Attribute Profiling via Vision-Language Model Agentic Framework | May 25, 2025 | AttributeLanguage Modeling | —Unverified | 0 |
| Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering | May 25, 2025 | AnatomyBenchmarking | CodeCode Available | 1 |
| ChartSketcher: Reasoning with Multimodal Feedback and Reflection for Chart Understanding | May 25, 2025 | Chart UnderstandingLogical Reasoning | CodeCode Available | 0 |
| Doc-CoB: Enhancing Multi-Modal Document Understanding with Visual Chain-of-Boxes Reasoning | May 24, 2025 | document understandingVisual Reasoning | —Unverified | 0 |
| Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps | May 24, 2025 | Scene UnderstandingSpatial Reasoning | —Unverified | 0 |
| GRE Suite: Geo-localization Inference via Fine-Tuned Vision-Language Models and Enhanced Reasoning Chains | May 24, 2025 | geo-localizationVisual Reasoning | CodeCode Available | 1 |
| FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving | May 23, 2025 | Autonomous DrivingImage Generation | —Unverified | 0 |
| One RL to See Them All: Visual Triple Unified Reinforcement Learning | May 23, 2025 | AllMath | —Unverified | 0 |
| DanmakuTPPBench: A Multi-modal Benchmark for Temporal Point Process Modeling and Understanding | May 23, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities | May 23, 2025 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | CodeCode Available | 1 |
| OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning | May 22, 2025 | Optical Character Recognition (OCR)Visual Reasoning | CodeCode Available | 0 |
| ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark | May 22, 2025 | document understandingMultimodal Reasoning | CodeCode Available | 1 |
| From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Pedagogical Visualization | May 22, 2025 | Visual Reasoning | CodeCode Available | 1 |
| RBench-V: A Primary Assessment for Visual Reasoning Models with Multi-modal Outputs | May 22, 2025 | Image ManipulationMath | —Unverified | 0 |
| OpenSeg-R: Improving Open-Vocabulary Segmentation via Step-by-Step Visual Reasoning | May 22, 2025 | Open Vocabulary Panoptic SegmentationOpen Vocabulary Semantic Segmentation | CodeCode Available | 1 |
| LaViDa: A Large Diffusion Language Model for Multimodal Understanding | May 22, 2025 | Instruction FollowingLanguage Modeling | CodeCode Available | 3 |
| GRIT: Teaching MLLMs to Think with Images | May 21, 2025 | Reinforcement Learning (RL)Visual Reasoning | —Unverified | 0 |
| Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning | May 21, 2025 | Reinforcement Learning (RL)Visual Reasoning | —Unverified | 0 |
| STAR-R1: Spacial TrAnsformation Reasoning by Reinforcing Multimodal LLMs | May 21, 2025 | Efficient ExplorationReinforcement Learning (RL) | CodeCode Available | 0 |
| Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL | May 21, 2025 | 4kMultimodal Reasoning | —Unverified | 0 |
| VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank | May 20, 2025 | Image GenerationImage Quality Assessment | CodeCode Available | 2 |
| DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning | May 20, 2025 | HallucinationMathematical Reasoning | CodeCode Available | 5 |
| Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning | May 20, 2025 | reinforcement-learningReinforcement Learning | —Unverified | 0 |
| ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models | May 19, 2025 | Chart Question AnsweringChart Understanding | —Unverified | 0 |
| Neurosymbolic Diffusion Models | May 19, 2025 | Autonomous DrivingUncertainty Quantification | CodeCode Available | 2 |
| Advancing Generalization Across a Variety of Abstract Visual Reasoning Tasks | May 19, 2025 | Visual Reasoning | —Unverified | 0 |
| ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language Models | May 19, 2025 | Visual Reasoning | CodeCode Available | 0 |
| RVTBench: A Benchmark for Visual Reasoning Tasks | May 17, 2025 | Reasoning SegmentationVisual Question Answering (VQA) | CodeCode Available | 0 |
| Human-Aligned Bench: Fine-Grained Assessment of Reasoning Ability in MLLMs vs. Humans | May 16, 2025 | Multimodal ReasoningVisual Reasoning | —Unverified | 0 |
| OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning | May 13, 2025 | Reinforcement Learning (RL)Visual Reasoning | CodeCode Available | 3 |
| Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI | May 9, 2025 | 4kDomain Generalization | CodeCode Available | 0 |
| EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning | May 7, 2025 | Multiple-choiceQuestion Answering | CodeCode Available | 2 |
| VLM Q-Learning: Aligning Vision-Language Models for Interactive Decision-Making | May 6, 2025 | Decision MakingGeneral Knowledge | —Unverified | 0 |
| A Survey of Slow Thinking-based Reasoning LLMs using Reinforced Learning and Inference-time Scaling Law | May 5, 2025 | MathMedical Diagnosis | —Unverified | 0 |
| Localizing Before Answering: A Hallucination Evaluation Benchmark for Grounded Medical Multimodal LLMs | Apr 30, 2025 | HallucinationHallucination Evaluation | —Unverified | 0 |
| NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks | Apr 28, 2025 | Task PlanningVision-Language-Action | —Unverified | 0 |
| Doxing via the Lens: Revealing Location-related Privacy Leakage on Multi-modal Large Reasoning Models | Apr 27, 2025 | Visual ReasoningWorld Knowledge | —Unverified | 0 |
| A Comprehensive Survey of Knowledge-Based Vision Question Answering Systems: The Lifecycle of Knowledge in Visual Reasoning Task | Apr 24, 2025 | Question AnsweringRetrieval | —Unverified | 0 |
| LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception | Apr 21, 2025 | MathMMLU | —Unverified | 0 |
| VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models | Apr 21, 2025 | AttributeVisual Reasoning | —Unverified | 0 |
| Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? | Apr 18, 2025 | MathVisual Reasoning | —Unverified | 0 |
| NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation | Apr 17, 2025 | Data AugmentationDiversity | CodeCode Available | 2 |
| LVLM_CSP: Accelerating Large Vision Language Models via Clustering, Scattering, and Pruning for Reasoning Segmentation | Apr 15, 2025 | Image CaptioningQuestion Answering | —Unverified | 0 |
| Visual Language Models show widespread visual deficits on neuropsychological tests | Apr 15, 2025 | Object RecognitionVisual Reasoning | —Unverified | 0 |
| CameraBench: Benchmarking Visual Reasoning in MLLMs via Photography | Apr 14, 2025 | BenchmarkingVisual Reasoning | —Unverified | 0 |
| VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge | Apr 14, 2025 | Logical ReasoningMultimodal Reasoning | —Unverified | 0 |
| SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models | Apr 10, 2025 | Reinforcement Learning (RL)Visual Reasoning | CodeCode Available | 2 |