| MindJourney: Test-Time Scaling with World Models for Spatial Reasoning | Jul 16, 2025 | Spatial Reasoning | —Unverified | 0 |
| Warehouse Spatial Question Answering with LLM Agent | Jul 14, 2025 | Question AnsweringSpatial Reasoning | CodeCode Available | 1 |
| EmbRACE-3K: Embodied Reasoning and Action in Complex Environments | Jul 14, 2025 | Scene UnderstandingSpatial Reasoning | —Unverified | 0 |
| M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning | Jul 11, 2025 | Spatial Reasoning | —Unverified | 0 |
| ByDeWay: Boost Your multimodal LLM with DEpth prompting in a Training-Free Way | Jul 11, 2025 | Depth EstimationHallucination | —Unverified | 0 |
| Scaling RL to Long Videos | Jul 10, 2025 | Reinforcement Learning (RL)Spatial Reasoning | CodeCode Available | 0 |
| OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding | Jul 10, 2025 | Scene UnderstandingSpatial Reasoning | CodeCode Available | 0 |
| A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding | Jul 9, 2025 | 3D visual groundingAutonomous Navigation | —Unverified | 0 |
| ImplicitQA: Going beyond frames towards Implicit Video Reasoning | Jun 26, 2025 | Spatial Reasoning | CodeCode Available | 0 |
| Optimising Language Models for Downstream Tasks: A Post-Training Perspective | Jun 26, 2025 | parameter-efficient fine-tuningSpatial Reasoning | —Unverified | 0 |
| ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models | Jun 26, 2025 | Spatial ReasoningVideo Generation | —Unverified | 0 |
| World-aware Planning Narratives Enhance Large Vision-Language Model Planner | Jun 26, 2025 | Imitation LearningLanguage Modeling | —Unverified | 0 |
| From 2D to 3D Cognition: A Brief Survey of General World Models | Jun 25, 2025 | Autonomous DrivingScene Generation | —Unverified | 0 |
| Video Perception Models for 3D Scene Synthesis | Jun 25, 2025 | 3D ReconstructionImage Generation | —Unverified | 0 |
| PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning | Jun 17, 2025 | General Reinforcement LearningMultimodal Reasoning | —Unverified | 0 |
| SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks | Jun 17, 2025 | MathSpatial Reasoning | —Unverified | 0 |
| ImmerseGen: Agent-Guided Immersive World Generation with Alpha-Textured Proxies | Jun 17, 2025 | Scene GenerationSpatial Reasoning | —Unverified | 0 |
| Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing | Jun 11, 2025 | Multimodal ReasoningSpatial Reasoning | CodeCode Available | 2 |
| Leveraging LLMs for Mission Planning in Precision Agriculture | Jun 11, 2025 | Spatial Reasoning | —Unverified | 0 |
| 3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation | Jun 11, 2025 | Spatial Reasoning | CodeCode Available | 1 |
| A Multi-Modal Spatial Risk Framework for EV Charging Infrastructure Using Remote Sensing | Jun 10, 2025 | Spatial Reasoning | —Unverified | 0 |
| PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly | Jun 10, 2025 | Question AnsweringScene Understanding | —Unverified | 0 |
| Direct Numerical Layout Generation for 3D Indoor Scene Synthesis via Spatial Reasoning | Jun 5, 2025 | In-Context LearningIndoor Scene Synthesis | —Unverified | 0 |
| Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations | Jun 5, 2025 | 4kSpatial Reasoning | CodeCode Available | 1 |
| From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes | Jun 5, 2025 | 3D visual groundingObject | —Unverified | 0 |
| SAVVY: Spatial Awareness via Audio-Visual LLMs through Seeing and Hearing | Jun 4, 2025 | Spatial Reasoning | —Unverified | 0 |
| RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics | Jun 4, 2025 | Spatial Reasoning | —Unverified | 0 |
| OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models | Jun 3, 2025 | Object CountingSpatial Reasoning | —Unverified | 0 |
| ReSpace: Text-Driven 3D Scene Synthesis and Editing with Preference Alignment | Jun 3, 2025 | Indoor Scene SynthesisObject | —Unverified | 0 |
| In-the-wild Audio Spatialization with Flexible Text-guided Localization | Jun 1, 2025 | Spatial Reasoning | CodeCode Available | 0 |
| Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces | May 30, 2025 | Spatial Reasoning | —Unverified | 0 |
| Out of Sight, Not Out of Context? Egocentric Spatial Reasoning in VLMs Across Disjoint Frames | May 30, 2025 | ObjectSpatial Reasoning | —Unverified | 0 |
| VideoCAD: A Large-Scale Video Dataset for Learning UI Interactions and 3D Reasoning from CAD Software | May 30, 2025 | Question AnsweringSpatial Reasoning | CodeCode Available | 1 |
| Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors | May 30, 2025 | 3D geometryLarge Language Model | CodeCode Available | 0 |
| Seeing is Not Reasoning: MVPBench for Graph-based Evaluation of Multi-path Visual Physical CoT | May 30, 2025 | Spatial ReasoningVisual Reasoning | CodeCode Available | 1 |
| Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition | May 29, 2025 | Handwritten Mathmatical Expression RecognitionLanguage Modeling | CodeCode Available | 1 |
| MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence | May 29, 2025 | Multiple-choiceSpatial Reasoning | —Unverified | 0 |
| Grounded Reinforcement Learning for Visual Reasoning | May 29, 2025 | reinforcement-learningReinforcement Learning | CodeCode Available | 0 |
| ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks | May 29, 2025 | Spatial Reasoning | CodeCode Available | 2 |
| Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence | May 29, 2025 | Spatial Reasoning | —Unverified | 0 |
| ChatVLA-2: Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge | May 28, 2025 | Imitation LearningMath | CodeCode Available | 1 |
| VLM Can Be a Good Assistant: Enhancing Embodied Visual Tracking with Self-Improving Vision-Language Models | May 27, 2025 | Spatial ReasoningVisual Tracking | —Unverified | 0 |
| Jigsaw-Puzzles: From Seeing to Understanding to Reasoning in Vision-Language Models | May 27, 2025 | DiagnosticSpatial Reasoning | —Unverified | 0 |
| MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents | May 26, 2025 | BenchmarkingMinecraft | CodeCode Available | 1 |
| VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction | May 26, 2025 | 3D ReconstructionSpatial Reasoning | CodeCode Available | 3 |
| MEBench: A Novel Benchmark for Understanding Mutual Exclusivity Bias in Vision-Language Models | May 26, 2025 | Spatial Reasoning | —Unverified | 0 |
| Agentic 3D Scene Generation with Spatially Contextualized VLMs | May 26, 2025 | Multimodal ReasoningScene Generation | —Unverified | 0 |
| ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers | May 26, 2025 | cross-modal alignmentPosition | —Unverified | 0 |
| Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps | May 24, 2025 | Scene UnderstandingSpatial Reasoning | —Unverified | 0 |
| U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding | May 23, 2025 | BenchmarkingSpatial Reasoning | —Unverified | 0 |