Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 281–290 of 571 papers

Title	Date	Tasks	Status
From Local Concepts to Universals: Evaluating the Multicultural Understanding of Vision-Language Models	Jun 28, 2024	DiversityRetrieval	—Unverified
From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes	Jun 5, 2025	3D visual groundingObject	—Unverified
G^3-LQ: Marrying Hyperbolic Alignment with Explicit Semantic-Geometric Modeling for 3D Visual Grounding	Jan 1, 2024	3D visual groundingVisual Grounding	—Unverified
GAFNet: A Global Fourier Self Attention Based Novel Network for multi-modal downstream tasks	Jan 1, 2023	Image GenerationImage-text Retrieval	—Unverified
GAGS: Granularity-Aware Feature Distillation for Language Gaussian Splatting	Dec 18, 2024	Scene UnderstandingSemantic Segmentation	—Unverified
GEMeX-ThinkVG: Towards Thinking with Visual Grounding in Medical VQA via Reinforcement Learning	Jun 22, 2025	Answer GenerationDecision Making	—Unverified
GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing	Jan 12, 2025	Image CaptioningLanguage Modeling	—Unverified
Giving Commands to a Self-driving Car: A Multimodal Reasoner for Visual Grounding	Mar 19, 2020	ObjectReferring Expression Comprehension	—Unverified
Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models	Oct 21, 2024	Instruction Followingobject-detection	—Unverified
GroundCap: A Visually Grounded Image Captioning Dataset	Feb 19, 2025	Image CaptioningObject Detection	—Unverified

Show:10 25 50

← PrevPage 29 of 58Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified