Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 471–480 of 571 papers

Title	Date	Tasks	Status
Towards Visual Text Grounding of Multimodal Large Language Model	Apr 7, 2025	BenchmarkingLanguage Modeling	—Unverified
Training-Free Reasoning and Reflection in MLLMs	May 22, 2025	DecoderMultimodal Reasoning	—Unverified
Transfer Learning from Audio-Visual Grounding to Speech Recognition	Jul 9, 2019	speech-recognitionSpeech Recognition	—Unverified
Transformers in Vision: A Survey	Jan 4, 2021	Action RecognitionActivity Recognition	—Unverified
TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding	Aug 5, 2021	3D visual groundingRelation	—Unverified
Compositional Temporal Visual Grounding of Natural Language Event Descriptions	Dec 4, 2019	Visual Grounding	—Unverified
Commands 4 Autonomous Vehicles (C4AV) Workshop Summary	Sep 18, 2020	Autonomous VehiclesReferring Expression Comprehension	—Unverified
TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation	Feb 11, 2025	RetrievalVision and Language Navigation	—Unverified
TreePrompt: Learning to Compose Tree Prompts for Explainable Visual Grounding	May 19, 2023	SentenceVisual Grounding	—Unverified
Class-agnostic Object Detection	Nov 28, 2020	BenchmarkingClass-agnostic Object Detection	—Unverified

Show:10 25 50

← PrevPage 48 of 58Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified