Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 226–250 of 571 papers

Title	Date	Tasks	Status
3D Spatial Understanding in MLLMs: Disambiguation and Evaluation	Dec 9, 2024	3D dense captioning3D visual grounding	—Unverified
Interpretable Visual Question Answering via Reasoning Supervision	Sep 7, 2023	Common Sense ReasoningQuestion Answering	—Unverified
Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining	Aug 1, 2018	Question AnsweringVisual Grounding	—Unverified
Interactive Visual Grounding of Referring Expressions for Human-Robot Interaction	Jun 11, 2018	Question GenerationQuestion-Generation	—Unverified
Differentiable Parsing and Visual Grounding of Natural Language Instructions for Object Placement	Oct 1, 2022	Graph Neural NetworkObject	—Unverified
Interactive Reinforcement Learning for Object Grounding via Self-Talking	Dec 2, 2017	Objectreinforcement-learning	—Unverified
Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention	May 28, 2024	3D Object Detection3D visual grounding	—Unverified
Differentiable Disentanglement Filter: an Application Agnostic Core Concept Discovery Probe	Sep 4, 2019	DisentanglementVisual Grounding	—Unverified
Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment	Mar 27, 2019	Image RetrievalPhrase Grounding	—Unverified
Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level	Nov 15, 2024	Benchmarkingcounterfactual	—Unverified
Joint Top-Down and Bottom-Up Frameworks for 3D Visual Grounding	Oct 21, 2024	3D visual groundingObject	—Unverified
Differentiable Disentanglement Filter: an Application Agnostic Core Concept Discovery Probe	Jul 17, 2019	DisentanglementVisual Grounding	—Unverified
Knowledge Supports Visual Language Grounding: A Case Study on Colour Terms	Jul 1, 2020	DiagnosticObject	—Unverified
Benchmarking Diverse-Modal Entity Linking with Generative Models	May 27, 2023	BenchmarkingDecoder	—Unverified
AIFit: Automatic 3D Human-Interpretable Feedback Models for Fitness Training	Jun 19, 2021	Visual Grounding	—Unverified
Individuation in Neural Models with and without Visual Grounding	Sep 27, 2024	Visual Grounding	—Unverified
Language-Guided 3D Object Detection in Point Cloud for Autonomous Driving	May 25, 2023	3D Object DetectionAutonomous Driving	—Unverified
Detecting Concrete Visual Tokens for Multimodal Machine Translation	Mar 5, 2024	Machine TranslationMultimodal Machine Translation	—Unverified
Being data-driven is not enough: Revisiting interactive instruction giving as a challenge for NLG	Nov 1, 2018	Text GenerationVisual Grounding	—Unverified
LanguageRefer: Spatial-Language Model for 3D Visual Grounding	Jul 7, 2021	3D visual groundingLanguage Modeling	—Unverified
DSM: Building A Diverse Semantic Map for 3D Visual Grounding	Apr 11, 2025	3D visual groundingScene Understanding	—Unverified
Improving Visually Grounded Sentence Representations with Self-Attention	Dec 2, 2017	SentenceVisual Grounding	—Unverified
Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding	Jun 13, 2024	3D visual groundingAttribute	—Unverified
DenseGrounding: Improving Dense Language-Vision Semantics for Ego-Centric 3D Visual Grounding	May 8, 2025	3D visual groundingcross-modal alignment	—Unverified
3DWG: 3D Weakly Supervised Visual Grounding via Category and Instance-Level Alignment	May 3, 2025	SentenceVisual Grounding	—Unverified

Show:10 25 50

← PrevPage 10 of 23Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified