| Embodied Scene Understanding for Vision Language Models via MetaVQA | Jan 15, 2025 | Decision MakingQuestion Answering | —Unverified | 0 |
| Boosting Diffusion-Based Text Image Super-Resolution Model Towards Generalized Real-World Scenarios | Mar 10, 2025 | Image RestorationImage Super-Resolution | —Unverified | 0 |
| Embodied Chain of Action Reasoning with Multi-Modal Foundation Model for Humanoid Loco-manipulation | Apr 13, 2025 | NavigateObject Rearrangement | —Unverified | 0 |
| An Evaluation of ChatGPT-4's Qualitative Spatial Reasoning Capabilities in RCC-8 | Sep 27, 2023 | Spatial Reasoning | —Unverified | 0 |
| 3D-MoE: A Mixture-of-Experts Multi-modal LLM for 3D Vision and Pose Diffusion via Rectified Flow | Jan 28, 2025 | Instruction FollowingMixture-of-Experts | —Unverified | 0 |
| Ego-Humans: An Ego-Centric 3D Multi-Human Benchmark | Jan 1, 2023 | 3D Pose EstimationHuman Detection | —Unverified | 0 |
| A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding | Jul 9, 2025 | 3D visual groundingAutonomous Navigation | —Unverified | 0 |
| Ego-Centric Spatial Memory Networks | Jan 1, 2021 | CPUGPU | —Unverified | 0 |
| EarthGPT-X: Enabling MLLMs to Flexibly and Comprehensively Understand Multi-Source Remote Sensing Imagery | Apr 17, 2025 | Large Language ModelMulti-Task Learning | —Unverified | 0 |
| Advancing Egocentric Video Question Answering with Multimodal Large Language Models | Apr 6, 2025 | Object RecognitionQuestion Answering | —Unverified | 0 |