Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 76–100 of 571 papers

Title	Date	Tasks	Status	Hype
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories	Mar 11, 2025	Decision MakingInteractive Segmentation	CodeCode Available	2
Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding	Mar 8, 2025	Language ModelingLanguage Modelling	—Unverified	0
Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions	Mar 5, 2025	Anomaly DetectionVisual Grounding	—Unverified	0
Teaching Metric Distance to Autoregressive Multimodal Foundational Models	Mar 4, 2025	Image GenerationVisual Grounding	—Unverified	0
Structured Preference Optimization for Vision-Language Long-Horizon Task Planning	Feb 28, 2025	Task PlanningVisual Grounding	—Unverified	0
ProxyTransformation: Preshaping Point Cloud Manifold With Proxy Attention For 3D Visual Grounding	Feb 26, 2025	3D visual groundingVisual Grounding	—Unverified	0
Programming with Pixels: Computer-Use Meets Software Engineering	Feb 24, 2025	Visual Grounding	—Unverified	0
SwimVG: Step-wise Multimodal Fusion and Adaption for Visual Grounding	Feb 24, 2025	cross-modal alignmentVisual Grounding	CodeCode Available	1
GroundCap: A Visually Grounded Image Captioning Dataset	Feb 19, 2025	Image CaptioningObject Detection	—Unverified	0
Leveraging Multimodal-LLMs Assisted by Instance Segmentation for Intelligent Traffic Monitoring	Feb 16, 2025	Instance SegmentationLanguage Modeling	—Unverified	0
Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding	Feb 14, 2025	3D Object Detection3D visual grounding	CodeCode Available	3
TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation	Feb 11, 2025	RetrievalVision and Language Navigation	—Unverified	0
Evolving Symbolic 3D Visual Grounder with Weakly Supervised Reflection	Feb 3, 2025	3D visual groundingVisual Grounding	CodeCode Available	1
NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning	Feb 1, 2025	Referring ExpressionVisual Grounding	CodeCode Available	1
RLS3: RL-Based Synthetic Sample Selection to Enhance Spatial Reasoning in Vision-Language Models for Indoor Autonomous Perception	Jan 31, 2025	Reinforcement Learning (RL)Spatial Reasoning	—Unverified	0
ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations	Jan 24, 2025	DecoderObject	—Unverified	0
PAINT: Paying Attention to INformed Tokens to Mitigate Hallucination in Large Vision-Language Model	Jan 21, 2025	HallucinationImage Captioning	CodeCode Available	1
When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis	Jan 17, 2025	Large Language ModelMultimodal Large Language Model	CodeCode Available	1
FLORA: Formal Language Model Enables Robust Training-free Zero-shot Object Referring Analysis	Jan 17, 2025	Bayesian InferenceLanguage Modeling	—Unverified	0
AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring	Jan 16, 2025	3D visual groundingDecoder	—Unverified	0
A Simple Aerial Detection Baseline of Multimodal Language Models	Jan 16, 2025	object-detectionObject Detection	CodeCode Available	2
Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints	Jan 12, 2025	Image SegmentationReferring Expression	CodeCode Available	1
GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing	Jan 12, 2025	Image CaptioningLanguage Modeling	—Unverified	0
Open Eyes, Then Reason: Fine-grained Visual Mathematical Understanding in MLLMs	Jan 11, 2025	MathMathematical Problem-Solving	CodeCode Available	1
URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics	Jan 8, 2025	MathMathematical Reasoning	CodeCode Available	2

Show:10 25 50

← PrevPage 4 of 23Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified