Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 151–175 of 571 papers

Title	Date	Tasks	Status	Hype	Score
CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision	Dec 14, 2021	Contrastive LearningRepresentation Learning	CodeCode Available	1	5
IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities	Aug 23, 2024	Language ModelingLanguage Modelling	CodeCode Available	1	5
Guessing State Tracking for Visual Dialogue	Feb 24, 2020	Visual Grounding	CodeCode Available	1	5
Fine-Grained Semantically Aligned Vision-Language Pre-Training	Aug 4, 2022	cross-modal alignmentobject-detection	CodeCode Available	1	5
Confidence-aware Pseudo-label Learning for Weakly Supervised Visual Grounding	Jan 1, 2023	DescriptiveObject	CodeCode Available	1	5
Improving One-stage Visual Grounding by Recursive Sub-query Construction	Aug 3, 2020	SentenceSentence Embedding	CodeCode Available	1	5
PAINT: Paying Attention to INformed Tokens to Mitigate Hallucination in Large Vision-Language Model	Jan 21, 2025	HallucinationImage Captioning	CodeCode Available	1	5
Visual Grounding for Object-Level Generalization in Reinforcement Learning	Aug 4, 2024	Language ModellingObject	CodeCode Available	1	5
Multi-View Transformer for 3D Visual Grounding	Apr 5, 2022	3D visual groundingVisual Grounding	CodeCode Available	1	5
Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based Segmentation	Jun 11, 2024	Grounded Multimodal Named Entity Recognitionnamed-entity-recognition	CodeCode Available	1	5
Evolving Symbolic 3D Visual Grounder with Weakly Supervised Reflection	Feb 3, 2025	3D visual groundingVisual Grounding	CodeCode Available	1	5
Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training	Jan 1, 2023	3D dense captioning3D visual grounding	CodeCode Available	1	5
GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection	Dec 22, 2023	Attributeobject-detection	CodeCode Available	1	5
CityRefer: Geography-aware 3D Visual Grounding Dataset on City-scale Point Cloud Data	Oct 28, 2023	3D visual groundingAutonomous Vehicles	CodeCode Available	1	5
Context Disentangling and Prototype Inheriting for Robust Visual Grounding	Dec 19, 2023	Visual Grounding	CodeCode Available	1	5
InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring	Mar 1, 2021	3D visual groundingAttribute	CodeCode Available	1	5
Learning Point-Language Hierarchical Alignment for 3D Visual Grounding	Oct 22, 2022	3D visual groundingSentence	CodeCode Available	1	5
NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning	Feb 1, 2025	Referring ExpressionVisual Grounding	CodeCode Available	1	5
RefChartQA: Grounding Visual Answer on Chart Images through Instruction Tuning	Mar 29, 2025	Chart Question AnsweringChart Understanding	CodeCode Available	1	5
Multi-Modal Dynamic Graph Transformer for Visual Grounding	Jan 1, 2022	Visual Grounding	CodeCode Available	1	5
Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment	Aug 29, 2022	cross-modal alignmentImage-text Retrieval	CodeCode Available	1	5
Grounded Situation Recognition with Transformers	Nov 19, 2021	DecoderGrounded Situation Recognition	CodeCode Available	1	5
Iterative Robust Visual Grounding with Masked Reference based Centerpoint Supervision	Jul 23, 2023	DecoderVisual Grounding	CodeCode Available	1	5
Relation-aware Instance Refinement for Weakly Supervised Visual Grounding	Mar 24, 2021	ObjectRelation	CodeCode Available	1	5
Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation	Sep 17, 2021	Dialogue GenerationVisual Grounding	CodeCode Available	1	5

Show:10 25 50

← PrevPage 7 of 23Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified