Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 126–150 of 571 papers

Title	Date	Tasks	Status	Hype	Score
SAT: 2D Semantics Assisted Training for 3D Visual Grounding	May 24, 2021	3D visual groundingObject	CodeCode Available	1	5
3D Vision and Language Pretraining with Large-Scale Synthetic Data	Jul 8, 2024	Dense CaptioningDiversity	CodeCode Available	1	5
HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning	Mar 19, 2024	Reinforcement Learning (RL)Visual Grounding	CodeCode Available	1	5
Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training	Jan 1, 2023	3D dense captioning3D visual grounding	CodeCode Available	1	5
Connecting What to Say With Where to Look by Modeling Human Attention Traces	May 12, 2021	Caption GenerationImage Captioning	CodeCode Available	1	5
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts	Nov 16, 2021	Cross-Modal RetrievalImage Captioning	CodeCode Available	1	5
Open Eyes, Then Reason: Fine-grained Visual Mathematical Understanding in MLLMs	Jan 11, 2025	MathMathematical Problem-Solving	CodeCode Available	1	5
Visual Grounding for Object-Level Generalization in Reinforcement Learning	Aug 4, 2024	Language ModellingObject	CodeCode Available	1	5
Confidence-aware Pseudo-label Learning for Weakly Supervised Visual Grounding	Jan 1, 2023	DescriptiveObject	CodeCode Available	1	5
GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection	Dec 22, 2023	Attributeobject-detection	CodeCode Available	1	5
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections	May 24, 2022	Computational Efficiencycross-modal alignment	CodeCode Available	1	5
MixGen: A New Multi-Modal Data Augmentation	Jun 16, 2022	Data AugmentationImage-text Retrieval	CodeCode Available	1	5
Grounded Situation Recognition with Transformers	Nov 19, 2021	DecoderGrounded Situation Recognition	CodeCode Available	1	5
Advancing Visual Grounding with Scene Knowledge: Benchmark and Method	Jul 21, 2023	Image-text matchingText Matching	CodeCode Available	1	5
Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving	May 13, 2025	3D visual groundingAutonomous Driving	CodeCode Available	1	5
Collaborative Transformers for Grounded Situation Recognition	Mar 30, 2022	Grounded Situation RecognitionImage Classification	CodeCode Available	1	5
GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection	Nov 5, 2023	Anomaly DetectionQuestion Answering	CodeCode Available	1	5
Guessing State Tracking for Visual Dialogue	Feb 24, 2020	Visual Grounding	CodeCode Available	1	5
MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding	Mar 5, 2024	3D visual groundingDecision Making	CodeCode Available	1	5
Mono3DVG: 3D Visual Grounding in Monocular Images	Dec 13, 2023	3D Object Detection3D visual grounding	CodeCode Available	1	5
Multi3DRefer: Grounding Text Description to Multiple 3D Objects	Sep 11, 2023	3D visual groundingContrastive Learning	CodeCode Available	1	5
GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging Cross-Modal Attention with Large Language Models	Dec 6, 2023	Autonomous DrivingAutonomous Vehicles	CodeCode Available	1	5
CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding	May 15, 2023	DiversityTransfer Learning	CodeCode Available	1	5
Look Before You Leap: Learning Landmark Features for One-Stage Visual Grounding	Apr 9, 2021	DescriptiveObject	CodeCode Available	1	5
CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision	Dec 14, 2021	Contrastive LearningRepresentation Learning	CodeCode Available	1	5

Show:10 25 50

← PrevPage 6 of 23Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified