Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 51–75 of 571 papers

Title	Date	Tasks	Status	Hype
Visual Intention Grounding for Egocentric Assistants	Apr 18, 2025	ObjectVisual Grounding	—Unverified	0
COUNTS: Benchmarking Object Detectors and Multimodal Large Language Models under Distribution Shifts	Apr 14, 2025	BenchmarkingObject	—Unverified	0
Ges3ViG: Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference Understanding	Apr 13, 2025	3D visual groundingData Augmentation	CodeCode Available	0
DSM: Building A Diverse Semantic Map for 3D Visual Grounding	Apr 11, 2025	3D visual groundingScene Understanding	—Unverified	0
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model	Apr 10, 2025	Language ModelingLanguage Modelling	CodeCode Available	9
AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations	Apr 10, 2025	Spatial ReasoningVisual Grounding	—Unverified	0
Towards Visual Text Grounding of Multimodal Large Language Model	Apr 7, 2025	BenchmarkingLanguage Modeling	—Unverified	0
STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection	Apr 3, 2025	Instruction FollowingLanguage Modeling	CodeCode Available	1
Multimodal Reference Visual Grounding	Apr 2, 2025	Few-Shot Object DetectionVisual Grounding	—Unverified	0
Image Difference Grounding with Natural Language	Apr 2, 2025	Visual Grounding	—Unverified	0
Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities	Apr 2, 2025	DescriptiveLarge Language Model	CodeCode Available	0
MB-ORES: A Multi-Branch Object Reasoner for Visual Grounding in Remote Sensing	Mar 31, 2025	Objectobject-detection	CodeCode Available	0
ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning	Mar 30, 2025	3D visual groundingFeature Splatting	—Unverified	0
Efficient Adaptation For Remote Sensing Visual Grounding	Mar 29, 2025	parameter-efficient fine-tuningVisual Grounding	—Unverified	0
RefChartQA: Grounding Visual Answer on Chart Images through Instruction Tuning	Mar 29, 2025	Chart Question AnsweringChart Understanding	CodeCode Available	1
NuGrounding: A Multi-View 3D Visual Grounding Framework in Autonomous Driving	Mar 28, 2025	3D visual groundingAutonomous Driving	—Unverified	0
Beyond Object Categories: Multi-Attribute Reference Understanding for Visual Grounding	Mar 25, 2025	AttributeObject	—Unverified	0
Seeing Speech and Sound: Distinguishing and Locating Audios in Visual Scenes	Mar 24, 2025	Cross-Modal RetrievalDisentanglement	—Unverified	0
A Vision Centric Remote Sensing Benchmark	Mar 20, 2025	Question AnsweringRepresentation Learning	—Unverified	0
Visual Position Prompt for MLLM based Visual Grounding	Mar 19, 2025	PositionVisual Grounding	CodeCode Available	1
LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation	Mar 18, 2025	DecoderObject	CodeCode Available	0
HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model	Mar 17, 2025	Image SegmentationSegmentation	CodeCode Available	2
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding	Mar 17, 2025	Domain GeneralizationMultimodal Reasoning	CodeCode Available	2
How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game	Mar 13, 2025	Multimodal ReasoningQuestion Answering	CodeCode Available	1
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories	Mar 11, 2025	Decision MakingInteractive Segmentation	CodeCode Available	2

Show:10 25 50

← PrevPage 3 of 23Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified