| LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models | May 23, 2023 | Common Sense ReasoningImage Generation | CodeCode Available | 2 |
| IRef-VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes | Mar 20, 2025 | Scene UnderstandingSpatial Reasoning | CodeCode Available | 2 |
| ConceptFusion: Open-set Multimodal 3D Mapping | Feb 14, 2023 | 3D geometryAutonomous Driving | CodeCode Available | 2 |
| AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO | Feb 20, 2025 | Autonomous NavigationNavigate | CodeCode Available | 2 |
| InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners | Apr 19, 2025 | Action GenerationLogical Reasoning | CodeCode Available | 2 |
| Introducing Visual Perception Token into Multimodal Large Language Model | Feb 24, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning | Dec 16, 2024 | HallucinationRobot Manipulation | CodeCode Available | 2 |
| DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving | Nov 20, 2024 | Autonomous Drivingmotion prediction | CodeCode Available | 2 |
| Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning | Apr 17, 2025 | Multimodal ReasoningReinforcement Learning (RL) | CodeCode Available | 2 |
| Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models | Jun 21, 2024 | Spatial Reasoning | CodeCode Available | 2 |