Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 326–350 of 571 papers

Title	Date	Tasks	Status
Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners	Apr 30, 2024	3D visual groundingVisual Grounding	—Unverified
GroundCap: A Visually Grounded Image Captioning Dataset	Feb 19, 2025	Image CaptioningObject Detection	—Unverified
Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models	Oct 21, 2024	Instruction Followingobject-detection	—Unverified
Neural Material Adaptor for Visual Grounding of Intrinsic Dynamics	Oct 10, 2024	Visual Grounding	—Unverified
Neural Slot Interpreters: Grounding Object Semantics in Emergent Slot Representations	Feb 2, 2024	Contrastive LearningObject	—Unverified
Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding	Mar 8, 2025	Language ModelingLanguage Modelling	—Unverified
Giving Commands to a Self-driving Car: A Multimodal Reasoner for Visual Grounding	Mar 19, 2020	ObjectReferring Expression Comprehension	—Unverified
A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding	Jul 9, 2025	3D visual groundingAutonomous Navigation	—Unverified
Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment	Mar 27, 2019	Image RetrievalPhrase Grounding	—Unverified
NuGrounding: A Multi-View 3D Visual Grounding Framework in Autonomous Driving	Mar 28, 2025	3D visual groundingAutonomous Driving	—Unverified
Object2Scene: Putting Objects in Context for Open-Vocabulary 3D Detection	Sep 18, 2023	3D Object Detection3D Open-Vocabulary Object Detection	—Unverified
GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing	Jan 12, 2025	Image CaptioningLanguage Modeling	—Unverified
OG: Equip vision occupancy with instance segmentation and visual grounding	Jul 12, 2023	Instance SegmentationSegmentation	—Unverified
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web	Feb 27, 2024	Language ModelingLanguage Modelling	—Unverified
Omni-Q: Omni-Directional Scene Understanding for Unsupervised Visual Grounding	Jan 1, 2024	Scene UnderstandingVisual Grounding	—Unverified
GEMeX-ThinkVG: Towards Thinking with Visual Grounding in Medical VQA via Reinforcement Learning	Jun 22, 2025	Answer GenerationDecision Making	—Unverified
On the Contributions of Visual and Textual Supervision in Low-Resource Semantic Speech Retrieval	Apr 24, 2019	RetrievalVisual Grounding	—Unverified
On the Role of Visual Grounding in VQA	Jun 26, 2024	Visual GroundingVisual Question Answering (VQA)	—Unverified
GAGS: Granularity-Aware Feature Distillation for Language Gaussian Splatting	Dec 18, 2024	Scene UnderstandingSemantic Segmentation	—Unverified
Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models	Jul 18, 2024	3D Semantic SegmentationSemantic Segmentation	—Unverified
OptiBox: Breaking the Limits of Proposals for Visual Grounding	Nov 29, 2019	Image CaptioningVisual Grounding	—Unverified
GAFNet: A Global Fourier Self Attention Based Novel Network for multi-modal downstream tasks	Jan 1, 2023	Image GenerationImage-text Retrieval	—Unverified
Overcoming Language Priors in Visual Question Answering with Adversarial Regularization	Oct 8, 2018	Question AnsweringVisual Grounding	—Unverified
G^3-LQ: Marrying Hyperbolic Alignment with Explicit Semantic-Geometric Modeling for 3D Visual Grounding	Jan 1, 2024	3D visual groundingVisual Grounding	—Unverified
Paint Outside the Box: Synthesizing and Selecting Training Data for Visual Grounding	Dec 1, 2024	Visual Grounding	—Unverified

Show:10 25 50

← PrevPage 14 of 23Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified