Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 201–225 of 571 papers

Title	Date	Tasks	Status	Hype
Local-Global Context Aware Transformer for Language-Guided Video Segmentation	Mar 18, 2022	Referring Expression SegmentationReferring Video Object Segmentation	CodeCode Available	1
Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding	Nov 25, 2022	3D visual groundingKnowledge Distillation	CodeCode Available	1
Mask Grounding for Referring Image Segmentation	Dec 19, 2023	cross-modal alignmentImage Segmentation	CodeCode Available	1
Multi-Modal Dynamic Graph Transformer for Visual Grounding	Jan 1, 2022	Visual Grounding	CodeCode Available	1
EAGLE: Enhanced Visual Grounding Minimizes Hallucinations in Instructional Multimodal Models	Jan 6, 2025	HallucinationVisual Grounding	—Unverified	0
Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding	Sep 28, 2022	DecoderVisual Grounding	—Unverified	0
Dynamic Inference With Grounding Based Vision and Language Models	Jan 1, 2023	Language ModellingReferring Expression	—Unverified	0
Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding	Jun 13, 2024	3D visual groundingAttribute	—Unverified	0
Bridging Modality Gap for Visual Grounding with Effecitve Cross-modal Distillation	Dec 29, 2023	Visual Grounding	—Unverified	0
A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding	Jul 9, 2025	3D visual groundingAutonomous Navigation	—Unverified	0
DSM: Building A Diverse Semantic Map for 3D Visual Grounding	Apr 11, 2025	3D visual groundingScene Understanding	—Unverified	0
ACTRESS: Active Retraining for Semi-supervised Visual Grounding	Jul 3, 2024	Binary ClassificationVisual Grounding	—Unverified	0
BlenderAlchemy: Editing 3D Graphics with Vision-Language Models	Apr 26, 2024	Game DesignImage Generation	—Unverified	0
Data-Efficient 3D Visual Grounding via Order-Aware Referring	Mar 25, 2024	3D visual groundingObject	—Unverified	0
Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation	May 24, 2025	Mathematical ReasoningMultimodal Reasoning	—Unverified	0
Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs	Jun 5, 2025	cross-modal alignmentDense Captioning	—Unverified	0
Beyond Object Categories: Multi-Attribute Reference Understanding for Visual Grounding	Mar 25, 2025	AttributeObject	—Unverified	0
A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical Image Analysis	Oct 31, 2023	DescriptiveMedical Image Analysis	—Unverified	0
Differentiable Parsing and Visual Grounding of Natural Language Instructions for Object Placement	Oct 1, 2022	Graph Neural NetworkObject	—Unverified	0
Interactive Reinforcement Learning for Object Grounding via Self-Talking	Dec 2, 2017	Objectreinforcement-learning	—Unverified	0
Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention	May 28, 2024	3D Object Detection3D visual grounding	—Unverified	0
Differentiable Disentanglement Filter: an Application Agnostic Core Concept Discovery Probe	Sep 4, 2019	DisentanglementVisual Grounding	—Unverified	0
Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment	Mar 27, 2019	Image RetrievalPhrase Grounding	—Unverified	0
3D Spatial Understanding in MLLMs: Disambiguation and Evaluation	Dec 9, 2024	3D dense captioning3D visual grounding	—Unverified	0
Like a bilingual baby: The advantage of visually grounding a bilingual language model	Oct 11, 2022	Language ModelingLanguage Modelling	—Unverified	0

Show:10 25 50

← PrevPage 9 of 23Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified