| EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent | Jul 21, 2025 | Multimodal Reasoning | —Unverified | 0 |
| Visual Place Recognition for Large-Scale UAV Applications | Jul 20, 2025 | BenchmarkingDiversity | —Unverified | 0 |
| Transformer-based Spatial Grounding: A Comprehensive Survey | Jul 17, 2025 | cross-modal alignmentSurvey | —Unverified | 0 |
| VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding | Jul 17, 2025 | Video GroundingVideo Understanding | —Unverified | 0 |
| Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark | Jul 17, 2025 | Multimodal ReasoningPose Estimation | —Unverified | 0 |
| LaViPlan : Language-Guided Visual Path Planning with RLVR | Jul 17, 2025 | Autonomous DrivingVision-Language-Action | —Unverified | 0 |
| Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities | Jul 17, 2025 | Large Language ModelVision and Language Navigation | —Unverified | 0 |
| AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation | Jul 17, 2025 | Vision-Language-Action | —Unverified | 0 |
| LoViC: Efficient Long Video Generation with Context Compression | Jul 17, 2025 | Text-to-Video GenerationVideo Generation | —Unverified | 0 |
| MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval | Jul 17, 2025 | Image RetrievalRe-Ranking | —Unverified | 0 |