SOTAVerified

Spatial Reasoning

Papers

Showing 150 of 453 papers

TitleStatusHype
MindJourney: Test-Time Scaling with World Models for Spatial Reasoning0
Warehouse Spatial Question Answering with LLM AgentCode1
EmbRACE-3K: Embodied Reasoning and Action in Complex Environments0
M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning0
ByDeWay: Boost Your multimodal LLM with DEpth prompting in a Training-Free Way0
Scaling RL to Long VideosCode0
OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene UnderstandingCode0
A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding0
ImplicitQA: Going beyond frames towards Implicit Video ReasoningCode0
Optimising Language Models for Downstream Tasks: A Post-Training Perspective0
ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models0
World-aware Planning Narratives Enhance Large Vision-Language Model Planner0
From 2D to 3D Cognition: A Brief Survey of General World Models0
Video Perception Models for 3D Scene Synthesis0
PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning0
SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks0
ImmerseGen: Agent-Guided Immersive World Generation with Alpha-Textured Proxies0
Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual DrawingCode2
Leveraging LLMs for Mission Planning in Precision Agriculture0
3D-Aware Vision-Language Models Fine-Tuning with Geometric DistillationCode1
A Multi-Modal Spatial Risk Framework for EV Charging Infrastructure Using Remote Sensing0
PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly0
Direct Numerical Layout Generation for 3D Indoor Scene Synthesis via Spatial Reasoning0
Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual SimulationsCode1
From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes0
SAVVY: Spatial Awareness via Audio-Visual LLMs through Seeing and Hearing0
RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics0
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models0
ReSpace: Text-Driven 3D Scene Synthesis and Editing with Preference Alignment0
In-the-wild Audio Spatialization with Flexible Text-guided LocalizationCode0
Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces0
Out of Sight, Not Out of Context? Egocentric Spatial Reasoning in VLMs Across Disjoint Frames0
VideoCAD: A Large-Scale Video Dataset for Learning UI Interactions and 3D Reasoning from CAD SoftwareCode1
Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry PriorsCode0
Seeing is Not Reasoning: MVPBench for Graph-based Evaluation of Multi-path Visual Physical CoTCode1
Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression RecognitionCode1
MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence0
Grounded Reinforcement Learning for Visual ReasoningCode0
ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing TasksCode2
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence0
ChatVLA-2: Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained KnowledgeCode1
VLM Can Be a Good Assistant: Enhancing Embodied Visual Tracking with Self-Improving Vision-Language Models0
Jigsaw-Puzzles: From Seeing to Understanding to Reasoning in Vision-Language Models0
MineAnyBuild: Benchmarking Spatial Planning for Open-world AI AgentsCode1
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D ReconstructionCode3
MEBench: A Novel Benchmark for Understanding Mutual Exclusivity Bias in Vision-Language Models0
Agentic 3D Scene Generation with Spatially Contextualized VLMs0
ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers0
Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps0
U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding0
Show:102550
← PrevPage 1 of 10Next →

No leaderboard results yet.