Visual Intention Grounding for Egocentric Assistants Apr 18, 2025 Object Visual Grounding
— Unverified 0COUNTS: Benchmarking Object Detectors and Multimodal Large Language Models under Distribution Shifts Apr 14, 2025 Benchmarking Object
— Unverified 0Ges3ViG: Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference Understanding Apr 13, 2025 3D visual grounding Data Augmentation
Code Code Available 0DSM: Building A Diverse Semantic Map for 3D Visual Grounding Apr 11, 2025 3D visual grounding Scene Understanding
— Unverified 0AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations Apr 10, 2025 Spatial Reasoning Visual Grounding
— Unverified 0VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model Apr 10, 2025 Language Modeling Language Modelling
Code Code Available 9Towards Visual Text Grounding of Multimodal Large Language Model Apr 7, 2025 Benchmarking Language Modeling
— Unverified 0STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection Apr 3, 2025 Instruction Following Language Modeling
Code Code Available 1Multimodal Reference Visual Grounding Apr 2, 2025 Few-Shot Object Detection Visual Grounding
— Unverified 0Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities Apr 2, 2025 Descriptive Large Language Model
Code Code Available 0Image Difference Grounding with Natural Language Apr 2, 2025 Visual Grounding
— Unverified 0MB-ORES: A Multi-Branch Object Reasoner for Visual Grounding in Remote Sensing Mar 31, 2025 Object object-detection
Code Code Available 0ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning Mar 30, 2025 3D visual grounding Feature Splatting
— Unverified 0Efficient Adaptation For Remote Sensing Visual Grounding Mar 29, 2025 parameter-efficient fine-tuning Visual Grounding
— Unverified 0RefChartQA: Grounding Visual Answer on Chart Images through Instruction Tuning Mar 29, 2025 Chart Question Answering Chart Understanding
Code Code Available 1NuGrounding: A Multi-View 3D Visual Grounding Framework in Autonomous Driving Mar 28, 2025 3D visual grounding Autonomous Driving
— Unverified 0Beyond Object Categories: Multi-Attribute Reference Understanding for Visual Grounding Mar 25, 2025 Attribute Object
— Unverified 0Seeing Speech and Sound: Distinguishing and Locating Audios in Visual Scenes Mar 24, 2025 Cross-Modal Retrieval Disentanglement
— Unverified 0A Vision Centric Remote Sensing Benchmark Mar 20, 2025 Question Answering Representation Learning
— Unverified 0Visual Position Prompt for MLLM based Visual Grounding Mar 19, 2025 Position Visual Grounding
Code Code Available 1LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation Mar 18, 2025 Decoder Object
Code Code Available 0HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model Mar 17, 2025 Image Segmentation Segmentation
Code Code Available 2DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding Mar 17, 2025 Domain Generalization Multimodal Reasoning
Code Code Available 2How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game Mar 13, 2025 Multimodal Reasoning Question Answering
Code Code Available 1SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories Mar 11, 2025 Decision Making Interactive Segmentation
Code Code Available 2SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories Mar 11, 2025 Decision Making Interactive Segmentation
Code Code Available 2Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding Mar 8, 2025 Language Modeling Language Modelling
— Unverified 0Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions Mar 5, 2025 Anomaly Detection Visual Grounding
— Unverified 0Teaching Metric Distance to Autoregressive Multimodal Foundational Models Mar 4, 2025 Image Generation Visual Grounding
— Unverified 0Structured Preference Optimization for Vision-Language Long-Horizon Task Planning Feb 28, 2025 Task Planning Visual Grounding
— Unverified 0ProxyTransformation: Preshaping Point Cloud Manifold With Proxy Attention For 3D Visual Grounding Feb 26, 2025 3D visual grounding Visual Grounding
— Unverified 0Programming with Pixels: Computer-Use Meets Software Engineering Feb 24, 2025 Visual Grounding
— Unverified 0SwimVG: Step-wise Multimodal Fusion and Adaption for Visual Grounding Feb 24, 2025 cross-modal alignment Visual Grounding
Code Code Available 1GroundCap: A Visually Grounded Image Captioning Dataset Feb 19, 2025 Image Captioning Object Detection
— Unverified 0Leveraging Multimodal-LLMs Assisted by Instance Segmentation for Intelligent Traffic Monitoring Feb 16, 2025 Instance Segmentation Language Modeling
— Unverified 0Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding Feb 14, 2025 3D Object Detection 3D visual grounding
Code Code Available 3TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation Feb 11, 2025 Retrieval Vision and Language Navigation
— Unverified 0Evolving Symbolic 3D Visual Grounder with Weakly Supervised Reflection Feb 3, 2025 3D visual grounding Visual Grounding
Code Code Available 1NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning Feb 1, 2025 Referring Expression Visual Grounding
Code Code Available 1RLS3: RL-Based Synthetic Sample Selection to Enhance Spatial Reasoning in Vision-Language Models for Indoor Autonomous Perception Jan 31, 2025 Reinforcement Learning (RL) Spatial Reasoning
— Unverified 0ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations Jan 24, 2025 Decoder Object
— Unverified 0PAINT: Paying Attention to INformed Tokens to Mitigate Hallucination in Large Vision-Language Model Jan 21, 2025 Hallucination Image Captioning
Code Code Available 1When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis Jan 17, 2025 Large Language Model Multimodal Large Language Model
Code Code Available 1FLORA: Formal Language Model Enables Robust Training-free Zero-shot Object Referring Analysis Jan 17, 2025 Bayesian Inference Language Modeling
— Unverified 0AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring Jan 16, 2025 3D visual grounding Decoder
— Unverified 0A Simple Aerial Detection Baseline of Multimodal Language Models Jan 16, 2025 object-detection Object Detection
Code Code Available 2Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints Jan 12, 2025 Image Segmentation Referring Expression
Code Code Available 1GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing Jan 12, 2025 Image Captioning Language Modeling
— Unverified 0Open Eyes, Then Reason: Fine-grained Visual Mathematical Understanding in MLLMs Jan 11, 2025 Math Mathematical Problem-Solving
Code Code Available 1URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics Jan 8, 2025 Math Mathematical Reasoning
Code Code Available 2