Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 26–50 of 571 papers

Title	Date	Tasks	Status	Hype
GTA1: GUI Test-time Scaling Agent	Jul 8, 2025	Reinforcement Learning (RL)Task Planning	CodeCode Available	2
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning	Jul 8, 2025	MMEReinforcement Learning (RL)	CodeCode Available	2
DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World	Jun 30, 2025	Caption GenerationObject	CodeCode Available	2
InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition	May 21, 2025	Earth ObservationObject	CodeCode Available	2
HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model	Mar 17, 2025	Image SegmentationSegmentation	CodeCode Available	2
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding	Mar 17, 2025	Domain GeneralizationMultimodal Reasoning	CodeCode Available	2
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories	Mar 11, 2025	Decision MakingInteractive Segmentation	CodeCode Available	2
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories	Mar 11, 2025	Decision MakingInteractive Segmentation	CodeCode Available	2
A Simple Aerial Detection Baseline of Multimodal Language Models	Jan 16, 2025	object-detectionObject Detection	CodeCode Available	2
URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics	Jan 8, 2025	MathMathematical Reasoning	CodeCode Available	2
Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention	Jan 1, 2025	HallucinationResponse Generation	CodeCode Available	2
Reasoning to Attend: Try to Understand How <SEG> Token Works	Dec 23, 2024	Semantic SimilaritySemantic Textual Similarity	CodeCode Available	2
TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action	Dec 7, 2024	Depth EstimationMathematical Reasoning	CodeCode Available	2
Interpreting Object-level Foundation Models via Visual Precision Search	Nov 25, 2024	Explainable Artificial Intelligence (XAI)Object	CodeCode Available	2
GeoGround: A Unified Large Vision-Language Model for Remote Sensing Visual Grounding	Nov 16, 2024	Instruction FollowingLanguage Modeling	CodeCode Available	2
VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding	Oct 17, 2024	3D geometry3D visual grounding	CodeCode Available	2
VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI	Oct 15, 2024	Question AnsweringVideo Question Answering	CodeCode Available	2
SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion	Sep 26, 2024	DescriptiveGeneralized Referring Expression Comprehension	CodeCode Available	2
Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding	Sep 5, 2024	Question AnsweringScene Understanding	CodeCode Available	2
In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation	Aug 9, 2024	Image to textObject	CodeCode Available	2
RefMask3D: Language-Guided Transformer for 3D Referring Segmentation	Jul 25, 2024	3D visual groundingImage Segmentation	CodeCode Available	2
SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding	Jul 3, 2024	object-detectionObject Detection	CodeCode Available	2
AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention	Jun 18, 2024	ObjectResponse Generation	CodeCode Available	2
VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding	Jun 18, 2024	Image CaptioningQuestion Answering	CodeCode Available	2
Towards Vision-Language Geo-Foundation Model: A Survey	Jun 13, 2024	Earth ObservationImage Captioning	CodeCode Available	2

Show:10 25 50

← PrevPage 2 of 23Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified