| IRef-VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes | Mar 20, 2025 | Scene UnderstandingSpatial Reasoning | CodeCode Available | 2 |
| Free-form language-based robotic reasoning and grasping | Mar 17, 2025 | FormRobotic Grasping | CodeCode Available | 2 |
| Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas | Mar 3, 2025 | Spatial Reasoning | CodeCode Available | 2 |
| Introducing Visual Perception Token into Multimodal Large Language Model | Feb 24, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO | Feb 20, 2025 | Autonomous NavigationNavigate | CodeCode Available | 2 |
| Imagine while Reasoning in Space: Multimodal Visualization-of-Thought | Jan 13, 2025 | Spatial Reasoning | CodeCode Available | 2 |
| Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning | Dec 16, 2024 | HallucinationRobot Manipulation | CodeCode Available | 2 |
| TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action | Dec 7, 2024 | Depth EstimationMathematical Reasoning | CodeCode Available | 2 |
| Probing the limitations of multimodal language models for chemistry and materials research | Nov 25, 2024 | Experimental DesignSpatial Reasoning | CodeCode Available | 2 |
| DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving | Nov 20, 2024 | Autonomous Drivingmotion prediction | CodeCode Available | 2 |