Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 551–571 of 571 papers

Title	Date	Tasks	Status
ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling	Feb 9, 2024	HallucinationNatural Language Understanding	CodeCode Available
Introspective Learning : A Two-Stage Approach for Inference in Neural Networks	Sep 17, 2022	Active LearningDecision Making	CodeCode Available
HuBo-VLM: Unified Vision-Language Model designed for HUman roBOt interaction tasks	Aug 24, 2023	Language ModelingLanguage Modelling	CodeCode Available
HiFi-CS: Towards Open Vocabulary Visual Grounding For Robotic Grasping Using Vision-Language Models	Sep 16, 2024	AttributeDecoder	CodeCode Available
GVCCI: Lifelong Learning of Visual Grounding for Language-Guided Robotic Manipulation	Jul 12, 2023	Lifelong learningObject Detection	CodeCode Available
Enhancing Visual Grounding and Generalization: A Multi-Task Cycle Training Approach for Vision-Language Models	Nov 21, 2023	Image SegmentationLanguage Modelling	CodeCode Available
Cost-Effective Language Driven Image Editing with LX-DRIM	Oct 1, 2022	Visual Grounding	CodeCode Available
Beyond Human Perception: Understanding Multi-Object World from Monocular View	Jan 1, 2025	3D visual groundingDenoising	CodeCode Available
To Find Waldo You Need Contextual Cues: Debiasing Who's Waldo	Mar 30, 2022	BenchmarkingPerson-centric Visual Grounding	CodeCode Available
To Find Waldo You Need Contextual Cues: Debiasing Who’s Waldo	May 1, 2022	BenchmarkingPerson-centric Visual Grounding	CodeCode Available
Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks	Jan 12, 2023	Cross-Modal RetrievalOpen-Ended Question Answering	CodeCode Available
Cosine meets Softmax: A tough-to-beat baseline for visual grounding	Sep 13, 2020	Autonomous DrivingMetric Learning	CodeCode Available
Towards CLIP-driven Language-free 3D Visual Grounding via 2D-3D Relational Enhancement and Consistency	Jan 1, 2024	3D visual groundingRelation	CodeCode Available
An Examination of the Robustness of Reference-Free Image Captioning Evaluation Metrics	May 24, 2023	Image CaptioningNegation	CodeCode Available
Visual Word2Vec (vis-w2v): Learning Visually Grounded Word Embeddings Using Abstract Scenes	Nov 22, 2015	Common Sense ReasoningImage Retrieval	CodeCode Available
Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities	Apr 2, 2025	DescriptiveLarge Language Model	CodeCode Available
Grounding of Textual Phrases in Images by Reconstruction	Nov 12, 2015	Language ModelingLanguage Modelling	CodeCode Available
GROOViST: A Metric for Grounding Objects in Visual Storytelling	Oct 26, 2023	Visual GroundingVisual Storytelling	CodeCode Available
Visual Contexts Clarify Ambiguous Expressions: A Benchmark Dataset	Nov 21, 2024	Question AnsweringVisual Grounding	CodeCode Available
Visual Coreference Resolution in Visual Dialog using Neural Module Networks	Sep 6, 2018	Common Sense Reasoningcoreference-resolution	CodeCode Available
Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models	Dec 3, 2023	HallucinationVisual Grounding	CodeCode Available

Show:10 25 50

← PrevPage 12 of 12Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified