SOTAVerified

Spatial Reasoning

Papers

Showing 51100 of 453 papers

TitleStatusHype
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive TasksCode2
Seeing the roads through the trees: A benchmark for modeling spatial dependencies with aerial imageryCode2
TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-ActionCode2
SpartQA: : A Textual Question Answering Benchmark for Spatial ReasoningCode1
Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation and Enhancement Using the StepGame BenchmarkCode1
SPARTQA: A Textual Question Answering Benchmark for Spatial ReasoningCode1
BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded DialoguesCode1
Learning Action and Reasoning-Centric Image Editing from Videos and SimulationsCode1
Joint Spatio-Textual Reasoning for Answering Tourism QuestionsCode1
Learning and Reasoning with the Graph Structure Representation in Robotic SurgeryCode1
LLMArena: Assessing Capabilities of Large Language Models in Dynamic Multi-Agent EnvironmentsCode1
SPARE3D: A Dataset for SPAtial REasoning on Three-View Line DrawingsCode1
iVISPAR -- An Interactive Visual-Spatial Reasoning Benchmark for VLMsCode1
Knot So Simple: A Minimalistic Environment for Spatial ReasoningCode1
An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal ModelsCode1
ING-VP: MLLMs cannot Play Easy Vision-based Games YetCode1
SmartPlay: A Benchmark for LLMs as Intelligent AgentsCode1
A Universal Semantic-Geometric Representation for Robotic ManipulationCode1
SE-KGE: A Location-Aware Knowledge Graph Embedding Model for Geographic Question Answering and Spatial Semantic LiftingCode1
Improved Visual-Spatial Reasoning via R1-Zero-Like TrainingCode1
DARA: Domain- and Relation-aware Adapters Make Parameter-efficient Tuning for Visual GroundingCode1
IndoNLI: A Natural Language Inference Dataset for IndonesianCode1
SmartFreeEdit: Mask-Free Spatial-Aware Image Editing with Complex Instruction UnderstandingCode1
Spatially Aware Multimodal Transformers for TextVQACode1
ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context PromptingCode1
3D-Aware Vision-Language Models Fine-Tuning with Geometric DistillationCode1
Revisiting spatio-temporal layouts for compositional action recognitionCode1
Grounded Chain-of-Thought for Multimodal Large Language ModelsCode1
AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video UnderstandingCode1
HSPFormer: Hierarchical Spatial Perception Transformer for Semantic SegmentationCode1
SBEVNet: End-to-End Deep Stereo Layout EstimationCode1
Geospatial Mechanistic Interpretability of Large Language ModelsCode1
From Seeing to Doing: Bridging Reasoning and Decision for Robotic ManipulationCode1
CLIPort: What and Where Pathways for Robotic ManipulationCode1
ReCLIP: A Strong Zero-Shot Baseline for Referring Expression ComprehensionCode1
CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global MemoryCode1
CityGPT: Empowering Urban Spatial Cognition of Large Language ModelsCode1
Seeing is Not Reasoning: MVPBench for Graph-based Evaluation of Multi-path Visual Physical CoTCode1
CoNav: Collaborative Cross-Modal Reasoning for Embodied NavigationCode1
CityEQA: A Hierarchical LLM Agent on Embodied Question Answering Benchmark in City SpaceCode1
OpenKD: Opening Prompt Diversity for Zero- and Few-shot Keypoint DetectionCode1
Grounding Consistency: Distilling Spatial Common Sense for Precise Visual Relationship DetectionCode1
Self-supervised Spatial Reasoning on Multi-View Line DrawingsCode1
GuessWhat?! Visual object discovery through multi-modal dialogueCode1
Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images using a View-based RepresentationCode1
Enhancing Reasoning to Adapt Large Language Models for Domain-Specific ApplicationsCode1
Open3DVQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open SpaceCode1
NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language ModelsCode1
Decoding Language Spatial Relations to 2D Spatial ArrangementsCode1
End-to-End Egospheric Spatial MemoryCode1
Show:102550
← PrevPage 2 of 10Next →

No leaderboard results yet.