SOTAVerified

Spatial Reasoning

Papers

Showing 101150 of 453 papers

TitleStatusHype
CityEQA: A Hierarchical LLM Agent on Embodied Question Answering Benchmark in City SpaceCode1
TopViewRS: Vision-Language Models as Top-View Spatial ReasonersCode1
Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal ModelsCode1
Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual SimulationsCode1
Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images using a View-based RepresentationCode1
Unsupervised Visual Chain-of-Thought Reasoning via Preference OptimizationCode1
On The Planning Abilities of OpenAI's o1 Models: Feasibility, Optimality, and GeneralizabilityCode1
NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language ModelsCode1
Vision-Language Models are Zero-Shot Reward Models for Reinforcement LearningCode1
Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under AmbiguitiesCode1
Open3DVQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open SpaceCode1
Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language ModelsCode1
Long Range Arena: A Benchmark for Efficient TransformersCode1
Visuospatial Cognitive AssistantCode1
MineAnyBuild: Benchmarking Spatial Planning for Open-world AI AgentsCode1
Warehouse Spatial Question Answering with LLM AgentCode1
OpenKD: Opening Prompt Diversity for Zero- and Few-shot Keypoint DetectionCode1
ReCLIP: A Strong Zero-Shot Baseline for Referring Expression ComprehensionCode1
Learning and Reasoning with the Graph Structure Representation in Robotic SurgeryCode1
Knot So Simple: A Minimalistic Environment for Spatial ReasoningCode1
LLMArena: Assessing Capabilities of Large Language Models in Dynamic Multi-Agent EnvironmentsCode1
ING-VP: MLLMs cannot Play Easy Vision-based Games YetCode1
iVISPAR -- An Interactive Visual-Spatial Reasoning Benchmark for VLMsCode1
Improved Visual-Spatial Reasoning via R1-Zero-Like TrainingCode1
Joint Spatio-Textual Reasoning for Answering Tourism QuestionsCode1
IndoNLI: A Natural Language Inference Dataset for IndonesianCode1
End-to-End Egospheric Spatial MemoryCode1
Can Large Language Models be Good Path Planners? A Benchmark and Investigation on Spatial-temporal ReasoningCode1
GuessWhat?! Visual object discovery through multi-modal dialogueCode1
Grounded Chain-of-Thought for Multimodal Large Language ModelsCode1
From Seeing to Doing: Bridging Reasoning and Decision for Robotic ManipulationCode1
Grounding Consistency: Distilling Spatial Common Sense for Precise Visual Relationship DetectionCode1
Enhancing Reasoning to Adapt Large Language Models for Domain-Specific ApplicationsCode1
HSPFormer: Hierarchical Spatial Perception Transformer for Semantic SegmentationCode1
EmbRACE-3K: Embodied Reasoning and Action in Complex Environments0
ByDeWay: Boost Your multimodal LLM with DEpth prompting in a Training-Free Way0
Embodied World Models Emerge from Navigational Task in Open-Ended Environments0
EmbodiedVSR: Dynamic Scene Graph-Guided Chain-of-Thought Reasoning for Visual Spatial Tasks0
Bridging Visualization and Optimization: Multimodal Large Language Models on Graph-Structured Combinatorial Optimization0
A Pilot Evaluation of ChatGPT and DALL-E 2 on Decision Making and Spatial Reasoning0
Embodied Scene Understanding for Vision Language Models via MetaVQA0
Boosting Diffusion-Based Text Image Super-Resolution Model Towards Generalized Real-World Scenarios0
Embodied Chain of Action Reasoning with Multi-Modal Foundation Model for Humanoid Loco-manipulation0
An Evaluation of ChatGPT-4's Qualitative Spatial Reasoning Capabilities in RCC-80
3D-MoE: A Mixture-of-Experts Multi-modal LLM for 3D Vision and Pose Diffusion via Rectified Flow0
Ego-Humans: An Ego-Centric 3D Multi-Human Benchmark0
A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding0
Ego-Centric Spatial Memory Networks0
EarthGPT-X: Enabling MLLMs to Flexibly and Comprehensively Understand Multi-Source Remote Sensing Imagery0
Advancing Egocentric Video Question Answering with Multimodal Large Language Models0
Show:102550
← PrevPage 3 of 10Next →

No leaderboard results yet.