Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 441–450 of 571 papers

Title	Date	Tasks	Status
ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding	Jan 2, 2025	3D visual groundingDiagnostic	—Unverified
VIMI: Grounding Video Generation through Multi-modal Instruction	Jul 8, 2024	Text-to-Video GenerationVideo Generation	—Unverified
Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding	May 18, 2023	Contrastive LearningObject	—Unverified
VIST-GPT: Ushering in the Era of Visual Storytelling with LLMs?	Apr 27, 2025	Visual GroundingVisual Storytelling	—Unverified
Visual Grounding Annotation of Recipe Flow Graph	May 1, 2020	Visual Grounding	—Unverified
Visual grounding for desktop graphical user interfaces	May 5, 2024	Language ModelingLanguage Modelling	—Unverified
How direct is the link between words and images?	Jun 30, 2022	Visual GroundingWord Embeddings	—Unverified
Visual Grounding of Inter-lingual Word-Embeddings	Sep 8, 2022	Visual GroundingWord Embeddings	—Unverified
Visual Grounding of Whole Radiology Reports for 3D CT Images	Dec 8, 2023	SegmentationVisual Grounding	—Unverified
Visual Grounding Strategies for Text-Only Natural Language Processing	Mar 25, 2021	Image RetrievalLanguage Modeling	—Unverified

Show:10 25 50

← PrevPage 45 of 58Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified