Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 301–325 of 571 papers

Title	Date	Tasks	Status
Improved Visual Grounding through Self-Consistent Explanations	Dec 7, 2023	Language ModellingLarge Language Model	—Unverified
Improving Visually Grounded Sentence Representations with Self-Attention	Dec 2, 2017	SentenceVisual Grounding	—Unverified
Individuation in Neural Models with and without Visual Grounding	Sep 27, 2024	Visual Grounding	—Unverified
Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention	May 28, 2024	3D Object Detection3D visual grounding	—Unverified
Interactive Reinforcement Learning for Object Grounding via Self-Talking	Dec 2, 2017	Objectreinforcement-learning	—Unverified
Interactive Visual Grounding of Referring Expressions for Human-Robot Interaction	Jun 11, 2018	Question GenerationQuestion-Generation	—Unverified
Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining	Aug 1, 2018	Question AnsweringVisual Grounding	—Unverified
Interpretable Visual Question Answering via Reasoning Supervision	Sep 7, 2023	Common Sense ReasoningQuestion Answering	—Unverified
INVIGORATE: Interactive Visual Grounding and Grasping in Clutter	Aug 25, 2021	BlockingObject	—Unverified
I Speak and You Find: Robust 3D Visual Grounding with Noisy and Ambiguous Speech Inputs	Jun 17, 2025	3D visual groundingContrastive Learning	—Unverified
Joint Top-Down and Bottom-Up Frameworks for 3D Visual Grounding	Oct 21, 2024	3D visual groundingObject	—Unverified
Knowledge Supports Visual Language Grounding: A Case Study on Colour Terms	Jul 1, 2020	DiagnosticObject	—Unverified
Language-Guided 3D Object Detection in Point Cloud for Autonomous Driving	May 25, 2023	3D Object DetectionAutonomous Driving	—Unverified
Language learning using Speech to Image retrieval	Sep 9, 2019	Grounded language learningImage Retrieval	—Unverified
LanguageRefer: Spatial-Language Model for 3D Visual Grounding	Jul 7, 2021	3D visual groundingLanguage Modeling	—Unverified
LCV2: An Efficient Pretraining-Free Framework for Grounded Visual Question Answering	Jan 29, 2024	Language ModelingLanguage Modelling	—Unverified
Learning from Synthetic Data for Visual Grounding	Mar 20, 2024	Language ModellingLarge Language Model	—Unverified
Visually Consistent Hierarchical Image Classification	Jun 17, 2024	Classificationimage-classification	—Unverified
Learning Language Structures through Grounding	Jun 14, 2024	Automatic Speech RecognitionDependency Parsing	—Unverified
Learning to Compose and Reason with Language Tree Structures for Visual Grounding	Jun 5, 2019	Visual GroundingVisual Reasoning	—Unverified
Learning to Ground VLMs without Forgetting	Oct 14, 2024	DecoderLanguage Modelling	—Unverified
Learning Unsupervised Visual Grounding Through Semantic Self-Supervision	Mar 17, 2018	Visual Grounding	—Unverified
Learning Visual Grounding from Generative Vision and Language Model	Jul 18, 2024	AttributeLanguage Modeling	—Unverified
Learning with Difference Attention for Visually Grounded Self-supervised Representations	Jun 26, 2023	Self-Supervised LearningVisual Grounding	—Unverified
Less is More: Generating Grounded Navigation Instructions from Landmarks	Nov 25, 2021	DecoderInstruction Following	—Unverified

Show:10 25 50

← PrevPage 13 of 23Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified