SOTAVerified

Spatial Reasoning

Papers

Showing 150 of 453 papers

TitleStatusHype
When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language ModelsCode7
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language ModelsCode7
Visual Instruction TuningCode6
GPT-4 Technical ReportCode6
Improved Baselines with Visual Instruction TuningCode6
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and BeyondCode5
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall SpacesCode4
Video-R1: Reinforcing Video Reasoning in MLLMsCode4
SAT: Dynamic Spatial Aptitude Training for Multimodal Language ModelsCode4
PointVLA: Injecting the 3D World into Vision-Language-Action ModelsCode4
Sonata: Self-Supervised Learning of Reliable Point RepresentationsCode4
Factorio Learning EnvironmentCode4
SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object ManipulationCode3
SpatialBot: Precise Spatial Understanding with Vision Language ModelsCode3
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language ModelsCode3
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D ReconstructionCode3
CityWalker: Learning Embodied Urban Navigation from Web-Scale VideosCode3
MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the MetaverseCode3
Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object SegmentationCode2
ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing TasksCode2
Text-to-CadQuery: A New Paradigm for CAD Generation with Scalable Large Model CapabilitiesCode2
SpatialScore: Towards Unified Evaluation for Multimodal Spatial UnderstandingCode2
SORT3D: Spatial Object-centric Reasoning Toolbox for Zero-Shot 3D Grounding Using Large Language ModelsCode2
AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPOCode2
GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement LearningCode2
Getting it Right: Improving Spatial Consistency in Text-to-Image ModelsCode2
SpaceR: Reinforcing MLLMs in Video Spatial ReasoningCode2
TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-ActionCode2
Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual DrawingCode2
Imagine while Reasoning in Space: Multimodal Visualization-of-ThoughtCode2
Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial ReasoningCode2
Seeing the roads through the trees: A benchmark for modeling spatial dependencies with aerial imageryCode2
DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous DrivingCode2
Act3D: 3D Feature Field Transformers for Multi-Task Robotic ManipulationCode2
Flow of Reasoning:Training LLMs for Divergent Problem Solving with Minimal ExamplesCode2
Probing the limitations of multimodal language models for chemistry and materials researchCode2
Locality Alignment Improves Vision-Language ModelsCode2
ConceptFusion: Open-set Multimodal 3D MappingCode2
Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language ModelsCode2
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive TasksCode2
LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language ModelsCode2
Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement LearningCode2
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative ReasonersCode2
Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D ScenesCode2
Introducing Visual Perception Token into Multimodal Large Language ModelCode2
End-to-End Navigation with Vision Language Models: Transforming Spatial Reasoning into Question-AnsweringCode2
BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual QuestionsCode2
From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3DCode2
Free-form language-based robotic reasoning and graspingCode2
Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies AheadCode2
Show:102550
← PrevPage 1 of 10Next →

No leaderboard results yet.