SOTAVerified

Spatial Reasoning

Papers

Showing 5175 of 453 papers

TitleStatusHype
Act3D: 3D Feature Field Transformers for Multi-Task Robotic ManipulationCode2
LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language ModelsCode2
ConceptFusion: Open-set Multimodal 3D MappingCode2
Warehouse Spatial Question Answering with LLM AgentCode1
3D-Aware Vision-Language Models Fine-Tuning with Geometric DistillationCode1
Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual SimulationsCode1
VideoCAD: A Large-Scale Video Dataset for Learning UI Interactions and 3D Reasoning from CAD SoftwareCode1
Seeing is Not Reasoning: MVPBench for Graph-based Evaluation of Multi-path Visual Physical CoTCode1
Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression RecognitionCode1
ChatVLA-2: Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained KnowledgeCode1
MineAnyBuild: Benchmarking Spatial Planning for Open-world AI AgentsCode1
Knot So Simple: A Minimalistic Environment for Spatial ReasoningCode1
CoNav: Collaborative Cross-Modal Reasoning for Embodied NavigationCode1
Visuospatial Cognitive AssistantCode1
Towards Visuospatial Cognition via Hierarchical Fusion of Visual ExpertsCode1
From Seeing to Doing: Bridging Reasoning and Decision for Robotic ManipulationCode1
CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global MemoryCode1
Geospatial Mechanistic Interpretability of Large Language ModelsCode1
Unsupervised Visual Chain-of-Thought Reasoning via Preference OptimizationCode1
SmartFreeEdit: Mask-Free Spatial-Aware Image Editing with Complex Instruction UnderstandingCode1
Improved Visual-Spatial Reasoning via R1-Zero-Like TrainingCode1
Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language ModelsCode1
NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language ModelsCode1
Grounded Chain-of-Thought for Multimodal Large Language ModelsCode1
Logic-RAG: Augmenting Large Multimodal Models with Visual-Spatial Knowledge for Road Scene UnderstandingCode1
Show:102550
← PrevPage 3 of 19Next →

No leaderboard results yet.