Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–50 of 571 papers

Title	Date	Tasks	Status	Hype
ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition	Jul 15, 2025	3D visual groundingVisual Grounding	—Unverified	0
VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation	Jul 9, 2025	Backdoor AttackVisual Grounding	—Unverified	0
A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding	Jul 9, 2025	3D visual groundingAutonomous Navigation	—Unverified	0
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning	Jul 8, 2025	MMEReinforcement Learning (RL)	CodeCode Available	2
GTA1: GUI Test-time Scaling Agent	Jul 8, 2025	Reinforcement Learning (RL)Task Planning	CodeCode Available	2
DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World	Jun 30, 2025	Caption GenerationObject	CodeCode Available	2
SPAZER: Spatial-Semantic Progressive Reasoning Agent for Zero-shot 3D Visual Grounding	Jun 27, 2025	3D visual groundingNatural Language Queries	—Unverified	0
HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation	Jun 26, 2025	counterfactualCounterfactual Reasoning	—Unverified	0
GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding	Jun 26, 2025	3D visual groundingLarge Language Model	—Unverified	0
DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document Images	Jun 26, 2025	document understandingOptical Character Recognition (OCR)	CodeCode Available	0
GEMeX-ThinkVG: Towards Thinking with Visual Grounding in Medical VQA via Reinforcement Learning	Jun 22, 2025	Answer GenerationDecision Making	—Unverified	0
I Speak and You Find: Robust 3D Visual Grounding with Noisy and Ambiguous Speech Inputs	Jun 17, 2025	3D visual groundingContrastive Learning	—Unverified	0
Unified Representation Space for 3D Visual Grounding	Jun 17, 2025	3D visual groundingContrastive Learning	—Unverified	0
Semantic Localization Guiding Segment Anything Model For Reference Remote Sensing Image Segmentation	Jun 12, 2025	Image SegmentationSegmentation	—Unverified	0
Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs	Jun 11, 2025	HallucinationObject Hallucination	CodeCode Available	1
EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments	Jun 9, 2025	BenchmarkingNavigate	—Unverified	0
Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs	Jun 5, 2025	cross-modal alignmentDense Captioning	—Unverified	0
Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning	Jun 5, 2025	MathVisual Grounding	—Unverified	0
From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes	Jun 5, 2025	3D visual groundingObject	—Unverified	0
RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought	Jun 4, 2025	Multimodal ReasoningReasoning Segmentation	—Unverified	0
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents	Jun 3, 2025	Visual Grounding	—Unverified	0
MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs	Jun 2, 2025	Instruction FollowingText Generation	—Unverified	0
D2AF: A Dual-Driven Annotation and Filtering Framework for Visual Grounding	May 30, 2025	DiversityPseudo Label	—Unverified	0
mRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation	May 29, 2025	Question AnsweringRAG	—Unverified	0
Zero-Shot 3D Visual Grounding from Vision-Language Models	May 28, 2025	3D visual groundingVisual Grounding	—Unverified	0
Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration	May 27, 2025	HallucinationVisual Grounding	—Unverified	0
Unveiling the Compositional Ability Gap in Vision-Language Reasoning Model	May 26, 2025	DiagnosticReinforcement Learning (RL)	CodeCode Available	0
Two Causally Related Needles in a Video Haystack	May 26, 2025	Video UnderstandingVisual Grounding	—Unverified	0
Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation	May 24, 2025	Mathematical ReasoningMultimodal Reasoning	—Unverified	0
CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays	May 23, 2025	DiagnosticQuestion Answering	CodeCode Available	0
More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models	May 23, 2025	DiagnosticHallucination	—Unverified	0
OrionBench: A Benchmark for Chart and Human-Recognizable Object Detection in Infographics	May 23, 2025	Chart Understandingobject-detection	CodeCode Available	3
Training-Free Reasoning and Reflection in MLLMs	May 22, 2025	DecoderMultimodal Reasoning	—Unverified	0
Redemption Score: An Evaluation Framework to Rank Image Captions While Redeeming Image Semantics and Language Pragmatics	May 22, 2025	Image Captioningtext similarity	—Unverified	0
GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents	May 21, 2025	Answer GenerationReinforcement Learning (RL)	CodeCode Available	1
Seeing the Trees for the Forest: Rethinking Weakly-Supervised Medical Visual Grounding	May 21, 2025	Visual Grounding	—Unverified	0
InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition	May 21, 2025	Earth ObservationObject	CodeCode Available	2
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning	May 20, 2025	Large Language ModelMultimodal Large Language Model	—Unverified	0
Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning	May 18, 2025	Reinforcement Learning (RL)Visual Grounding	CodeCode Available	3
MedSG-Bench: A Benchmark for Medical Image Sequences Grounding	May 17, 2025	Visual GroundingVisual Question Answering (VQA)	—Unverified	0
TinyRS-R1: Compact Multimodal Language Model for Remote Sensing	May 17, 2025	Language ModelingLanguage Modelling	—Unverified	0
UniMoCo: Unified Modality Completion for Robust Multi-Modal Embeddings	May 17, 2025	Image to textInformation Retrieval	CodeCode Available	0
HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation	May 16, 2025	BenchmarkingEthics	CodeCode Available	0
Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving	May 13, 2025	3D visual groundingAutonomous Driving	CodeCode Available	1
Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI	May 9, 2025	4kDomain Generalization	CodeCode Available	0
DenseGrounding: Improving Dense Language-Vision Semantics for Ego-Centric 3D Visual Grounding	May 8, 2025	3D visual groundingcross-modal alignment	—Unverified	0
AS3D: 2D-Assisted Cross-Modal Understanding with Semantic-Spatial Scene Graphs for 3D Visual Grounding	May 7, 2025	3D visual groundingGraph Attention	CodeCode Available	0
3DWG: 3D Weakly Supervised Visual Grounding via Category and Instance-Level Alignment	May 3, 2025	SentenceVisual Grounding	—Unverified	0
VIST-GPT: Ushering in the Era of Visual Storytelling with LLMs?	Apr 27, 2025	Visual GroundingVisual Storytelling	—Unverified	0
Revisiting Data Auditing in Large Vision-Language Models	Apr 25, 2025	Visual Grounding	—Unverified	0

Show:10 25 50

← PrevPage 1 of 12Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified