Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 101–125 of 571 papers

Title	Date	Tasks	Status	Hype	Score
A Unified Framework for 3D Point Cloud Visual Grounding	Aug 23, 2023	CPUGPU	CodeCode Available	1	5
Cross3DVG: Cross-Dataset 3D Visual Grounding on Different RGB-D Scans	May 23, 2023	3D Reconstruction3D visual grounding	CodeCode Available	1	5
Deep Multimodal Neural Architecture Search	Apr 25, 2020	DecoderImage-text matching	CodeCode Available	1	5
CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models	Sep 24, 2021	Visual Grounding	CodeCode Available	1	5
A Fast and Accurate One-Stage Approach to Visual Grounding	Aug 18, 2019	Referring ExpressionReferring Expression Comprehension	CodeCode Available	1	5
Grounded Situation Recognition with Transformers	Nov 19, 2021	DecoderGrounded Situation Recognition	CodeCode Available	1	5
Mask Grounding for Referring Image Segmentation	Dec 19, 2023	cross-modal alignmentImage Segmentation	CodeCode Available	1	5
GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection	Dec 22, 2023	Attributeobject-detection	CodeCode Available	1	5
Guessing State Tracking for Visual Dialogue	Feb 24, 2020	Visual Grounding	CodeCode Available	1	5
PAINT: Paying Attention to INformed Tokens to Mitigate Hallucination in Large Vision-Language Model	Jan 21, 2025	HallucinationImage Captioning	CodeCode Available	1	5
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding	Apr 26, 2021	Generalized Referring Expression ComprehensionPhrase Grounding	CodeCode Available	1	5
CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding	Oct 10, 2023	3D visual groundingVisual Grounding	CodeCode Available	1	5
Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with Human Intentions	Feb 17, 2024	Visual Grounding	CodeCode Available	1	5
Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding	Jul 18, 2023	3D visual groundingObject	CodeCode Available	1	5
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks	Nov 10, 2023	DiversityMulti-Task Learning	CodeCode Available	1	5
MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding	Mar 5, 2024	3D visual groundingDecision Making	CodeCode Available	1	5
LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition	Feb 15, 2024	Grounded Multimodal Named Entity RecognitionMulti-modal Named Entity Recognition	CodeCode Available	1	5
Local-Global Context Aware Transformer for Language-Guided Video Segmentation	Mar 18, 2022	Referring Expression SegmentationReferring Video Object Segmentation	CodeCode Available	1	5
Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling	Mar 21, 2024	Grounded language learningLanguage Acquisition	CodeCode Available	1	5
Visual Grounding Methods for VQA are Working for the Wrong Reasons!	Apr 12, 2020	Question AnsweringVisual Grounding	CodeCode Available	1	5
Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory	Mar 19, 2024	Adversarial TextDiversity	CodeCode Available	1	5
IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities	Aug 23, 2024	Language ModelingLanguage Modelling	CodeCode Available	1	5
Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding	Nov 25, 2022	3D visual groundingKnowledge Distillation	CodeCode Available	1	5
Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning	Apr 30, 2022	AttributeDecoder	CodeCode Available	1	5
Context Disentangling and Prototype Inheriting for Robust Visual Grounding	Dec 19, 2023	Visual Grounding	CodeCode Available	1	5

Show:10 25 50

← PrevPage 5 of 23Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified