Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 51–75 of 571 papers

Title	Date	Tasks	Status	Hype	Score
TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action	Dec 7, 2024	Depth EstimationMathematical Reasoning	CodeCode Available	2	5
GTA1: GUI Test-time Scaling Agent	Jul 8, 2025	Reinforcement Learning (RL)Task Planning	CodeCode Available	2	5
A Simple Aerial Detection Baseline of Multimodal Language Models	Jan 16, 2025	object-detectionObject Detection	CodeCode Available	2	5
URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics	Jan 8, 2025	MathMathematical Reasoning	CodeCode Available	2	5
NExT-Chat: An LMM for Chat, Detection and Segmentation	Nov 8, 2023	Referring ExpressionReferring Expression Segmentation	CodeCode Available	2	5
One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompts	Dec 28, 2023	AllAnatomy	CodeCode Available	2	5
Aligning and Prompting Everything All at Once for Universal Visual Perception	Dec 4, 2023	AllObject	CodeCode Available	2	5
Reasoning to Attend: Try to Understand How <SEG> Token Works	Dec 23, 2024	Semantic SimilaritySemantic Textual Similarity	CodeCode Available	2	5
MedPromptX: Grounded Multimodal Prompting for Chest X-ray Diagnosis	Mar 22, 2024	Medical DiagnosisMedical Visual Question Answering	CodeCode Available	2	5
Interpreting Object-level Foundation Models via Visual Precision Search	Nov 25, 2024	Explainable Artificial Intelligence (XAI)Object	CodeCode Available	2	5
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs	Apr 25, 2024	Visual GroundingVisual Question Answering	CodeCode Available	2	5
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs	Jul 17, 2023	Instruction FollowingSentence	CodeCode Available	2	5
ChatterBox: Multi-round Multimodal Referring and Grounding	Jan 24, 2024	Language ModelingLanguage Modelling	CodeCode Available	2	5
F-LMM: Grounding Frozen Large Multimodal Models	Jun 9, 2024	General KnowledgeInstruction Following	CodeCode Available	2	5
In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation	Aug 9, 2024	Image to textObject	CodeCode Available	2	5
InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition	May 21, 2025	Earth ObservationObject	CodeCode Available	2	5
Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention	Jan 1, 2025	HallucinationResponse Generation	CodeCode Available	2	5
Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding	Sep 5, 2024	Question AnsweringScene Understanding	CodeCode Available	2	5
Referring Image Matting	Jun 10, 2022	Domain GeneralizationImage Matting	CodeCode Available	2	5
DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding	Nov 28, 2022	object-detectionObject Detection	CodeCode Available	1	5
Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory	Mar 19, 2024	Adversarial TextDiversity	CodeCode Available	1	5
Visual Grounding Methods for VQA are Working for the Wrong Reasons!	Apr 12, 2020	Question AnsweringVisual Grounding	CodeCode Available	1	5
Learning Point-Language Hierarchical Alignment for 3D Visual Grounding	Oct 22, 2022	3D visual groundingSentence	CodeCode Available	1	5
Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with Human Intentions	Feb 17, 2024	Visual Grounding	CodeCode Available	1	5
Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding	Jul 18, 2023	3D visual groundingObject	CodeCode Available	1	5

Show:10 25 50

← PrevPage 3 of 23Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified