ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition Jul 15, 2025 3D visual grounding Visual Grounding
— Unverified 0VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation Jul 9, 2025 Backdoor Attack Visual Grounding
— Unverified 0A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding Jul 9, 2025 3D visual grounding Autonomous Navigation
— Unverified 0High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning Jul 8, 2025 MME Reinforcement Learning (RL)
Code Code Available 2GTA1: GUI Test-time Scaling Agent Jul 8, 2025 Reinforcement Learning (RL) Task Planning
Code Code Available 2DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World Jun 30, 2025 Caption Generation Object
Code Code Available 2SPAZER: Spatial-Semantic Progressive Reasoning Agent for Zero-shot 3D Visual Grounding Jun 27, 2025 3D visual grounding Natural Language Queries
— Unverified 0HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation Jun 26, 2025 counterfactual Counterfactual Reasoning
— Unverified 0GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding Jun 26, 2025 3D visual grounding Large Language Model
— Unverified 0DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document Images Jun 26, 2025 document understanding Optical Character Recognition (OCR)
Code Code Available 0GEMeX-ThinkVG: Towards Thinking with Visual Grounding in Medical VQA via Reinforcement Learning Jun 22, 2025 Answer Generation Decision Making
— Unverified 0I Speak and You Find: Robust 3D Visual Grounding with Noisy and Ambiguous Speech Inputs Jun 17, 2025 3D visual grounding Contrastive Learning
— Unverified 0Unified Representation Space for 3D Visual Grounding Jun 17, 2025 3D visual grounding Contrastive Learning
— Unverified 0Semantic Localization Guiding Segment Anything Model For Reference Remote Sensing Image Segmentation Jun 12, 2025 Image Segmentation Segmentation
— Unverified 0Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs Jun 11, 2025 Hallucination Object Hallucination
Code Code Available 1EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments Jun 9, 2025 Benchmarking Navigate
— Unverified 0Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs Jun 5, 2025 cross-modal alignment Dense Captioning
— Unverified 0Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning Jun 5, 2025 Math Visual Grounding
— Unverified 0From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes Jun 5, 2025 3D visual grounding Object
— Unverified 0RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought Jun 4, 2025 Multimodal Reasoning Reasoning Segmentation
— Unverified 0GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents Jun 3, 2025 Visual Grounding
— Unverified 0MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs Jun 2, 2025 Instruction Following Text Generation
— Unverified 0D2AF: A Dual-Driven Annotation and Filtering Framework for Visual Grounding May 30, 2025 Diversity Pseudo Label
— Unverified 0mRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation May 29, 2025 Question Answering RAG
— Unverified 0Zero-Shot 3D Visual Grounding from Vision-Language Models May 28, 2025 3D visual grounding Visual Grounding
— Unverified 0Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration May 27, 2025 Hallucination Visual Grounding
— Unverified 0Unveiling the Compositional Ability Gap in Vision-Language Reasoning Model May 26, 2025 Diagnostic Reinforcement Learning (RL)
Code Code Available 0Two Causally Related Needles in a Video Haystack May 26, 2025 Video Understanding Visual Grounding
— Unverified 0Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation May 24, 2025 Mathematical Reasoning Multimodal Reasoning
— Unverified 0CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays May 23, 2025 Diagnostic Question Answering
Code Code Available 0More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models May 23, 2025 Diagnostic Hallucination
— Unverified 0OrionBench: A Benchmark for Chart and Human-Recognizable Object Detection in Infographics May 23, 2025 Chart Understanding object-detection
Code Code Available 3Training-Free Reasoning and Reflection in MLLMs May 22, 2025 Decoder Multimodal Reasoning
— Unverified 0Redemption Score: An Evaluation Framework to Rank Image Captions While Redeeming Image Semantics and Language Pragmatics May 22, 2025 Image Captioning text similarity
— Unverified 0GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents May 21, 2025 Answer Generation Reinforcement Learning (RL)
Code Code Available 1Seeing the Trees for the Forest: Rethinking Weakly-Supervised Medical Visual Grounding May 21, 2025 Visual Grounding
— Unverified 0InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition May 21, 2025 Earth Observation Object
Code Code Available 2UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning May 20, 2025 Large Language Model Multimodal Large Language Model
— Unverified 0Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning May 18, 2025 Reinforcement Learning (RL) Visual Grounding
Code Code Available 3MedSG-Bench: A Benchmark for Medical Image Sequences Grounding May 17, 2025 Visual Grounding Visual Question Answering (VQA)
— Unverified 0TinyRS-R1: Compact Multimodal Language Model for Remote Sensing May 17, 2025 Language Modeling Language Modelling
— Unverified 0UniMoCo: Unified Modality Completion for Robust Multi-Modal Embeddings May 17, 2025 Image to text Information Retrieval
Code Code Available 0HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation May 16, 2025 Benchmarking Ethics
Code Code Available 0Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving May 13, 2025 3D visual grounding Autonomous Driving
Code Code Available 1Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI May 9, 2025 4k Domain Generalization
Code Code Available 0DenseGrounding: Improving Dense Language-Vision Semantics for Ego-Centric 3D Visual Grounding May 8, 2025 3D visual grounding cross-modal alignment
— Unverified 0AS3D: 2D-Assisted Cross-Modal Understanding with Semantic-Spatial Scene Graphs for 3D Visual Grounding May 7, 2025 3D visual grounding Graph Attention
Code Code Available 03DWG: 3D Weakly Supervised Visual Grounding via Category and Instance-Level Alignment May 3, 2025 Sentence Visual Grounding
— Unverified 0VIST-GPT: Ushering in the Era of Visual Storytelling with LLMs? Apr 27, 2025 Visual Grounding Visual Storytelling
— Unverified 0Revisiting Data Auditing in Large Vision-Language Models Apr 25, 2025 Visual Grounding
— Unverified 0