Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 521–530 of 571 papers

Title	Date	Tasks	Status
MB-ORES: A Multi-Branch Object Reasoner for Visual Grounding in Remote Sensing	Mar 31, 2025	Objectobject-detection	CodeCode Available
M^3D: A Multimodal, Multilingual and Multitask Dataset for Grounded Document-level Information Extraction	Dec 5, 2024	Relation ExtractionVisual Grounding	CodeCode Available
Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI	May 9, 2025	4kDomain Generalization	CodeCode Available
Shaking Up VLMs: Comparing Transformers and Structured State Space Models for Vision & Language Modeling	Sep 9, 2024	Language ModelingLanguage Modelling	CodeCode Available
Leverage Points in Modality Shifts: Comparing Language-only and Multimodal Word Representations	Jun 4, 2023	Visual GroundingWord Embeddings	CodeCode Available
LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation	Mar 18, 2025	DecoderObject	CodeCode Available
Discovering the Unknown Knowns: Turning Implicit Knowledge in the Dataset into Explicit Training Examples for Visual Question Answering	Sep 13, 2021	Data AugmentationQuestion Answering	CodeCode Available
Learning Two-Branch Neural Networks for Image-Text Matching Tasks	Apr 11, 2017	Image-text matchingRetrieval	CodeCode Available
SiRi: A Simple Selective Retraining Mechanism for Transformer-based Visual Grounding	Jul 27, 2022	Visual Grounding	CodeCode Available
Learning to ground medical text in a 3D human atlas	Nov 1, 2020	Phrase GroundingVisual Grounding	CodeCode Available

Show:10 25 50

← PrevPage 53 of 58Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified