Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 271–280 of 571 papers

Title	Date	Tasks	Status	Score
Composing Pick-and-Place Tasks By Grounding Language	Feb 16, 2021	Natural Language Visual GroundingRobotic Grasping	CodeCode Available	5
Finding beans in burgers: Deep semantic-visual embedding with localization	Apr 5, 2018	Cross-Modal RetrievalImage Captioning	CodeCode Available	5
Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition	Jul 5, 2024	Visual GroundingVisual Storytelling	CodeCode Available	5
Collecting Visually-Grounded Dialogue with A Game Of Sorts	Sep 10, 2023	Coreference ResolutionImage Retrieval	CodeCode Available	5
Few-Shot Multimodal Explanation for Visual Question Answering	Oct 28, 2024	Explainable artificial intelligenceExplainable Artificial Intelligence (XAI)	CodeCode Available	5
Neural Twins Talk	Sep 26, 2020	Image CaptioningSentence	CodeCode Available	5
AS3D: 2D-Assisted Cross-Modal Understanding with Semantic-Spatial Scene Graphs for 3D Visual Grounding	May 7, 2025	3D visual groundingGraph Attention	CodeCode Available	5
Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model	Jul 7, 2024	SegmentationSentence	CodeCode Available	5
CoF: Coarse to Fine-Grained Image Understanding for Multi-modal Large Language Models	Dec 22, 2024	Language ModelingLanguage Modelling	CodeCode Available	5
NICE: Improving Panoptic Narrative Detection and Segmentation with Cascading Collaborative Learning	Oct 17, 2023	SegmentationVisual Grounding	CodeCode Available	5

Show:10 25 50

← PrevPage 28 of 58Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified