Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 26–50 of 571 papers

Title	Date	Tasks	Status	Hype
Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration	May 27, 2025	HallucinationVisual Grounding	—Unverified	0
Unveiling the Compositional Ability Gap in Vision-Language Reasoning Model	May 26, 2025	DiagnosticReinforcement Learning (RL)	CodeCode Available	0
Two Causally Related Needles in a Video Haystack	May 26, 2025	Video UnderstandingVisual Grounding	—Unverified	0
Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation	May 24, 2025	Mathematical ReasoningMultimodal Reasoning	—Unverified	0
CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays	May 23, 2025	DiagnosticQuestion Answering	CodeCode Available	0
More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models	May 23, 2025	DiagnosticHallucination	—Unverified	0
OrionBench: A Benchmark for Chart and Human-Recognizable Object Detection in Infographics	May 23, 2025	Chart Understandingobject-detection	CodeCode Available	3
Training-Free Reasoning and Reflection in MLLMs	May 22, 2025	DecoderMultimodal Reasoning	—Unverified	0
Redemption Score: An Evaluation Framework to Rank Image Captions While Redeeming Image Semantics and Language Pragmatics	May 22, 2025	Image Captioningtext similarity	—Unverified	0
GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents	May 21, 2025	Answer GenerationReinforcement Learning (RL)	CodeCode Available	1
Seeing the Trees for the Forest: Rethinking Weakly-Supervised Medical Visual Grounding	May 21, 2025	Visual Grounding	—Unverified	0
InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition	May 21, 2025	Earth ObservationObject	CodeCode Available	2
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning	May 20, 2025	Large Language ModelMultimodal Large Language Model	—Unverified	0
Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning	May 18, 2025	Reinforcement Learning (RL)Visual Grounding	CodeCode Available	3
MedSG-Bench: A Benchmark for Medical Image Sequences Grounding	May 17, 2025	Visual GroundingVisual Question Answering (VQA)	—Unverified	0
TinyRS-R1: Compact Multimodal Language Model for Remote Sensing	May 17, 2025	Language ModelingLanguage Modelling	—Unverified	0
UniMoCo: Unified Modality Completion for Robust Multi-Modal Embeddings	May 17, 2025	Image to textInformation Retrieval	CodeCode Available	0
HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation	May 16, 2025	BenchmarkingEthics	CodeCode Available	0
Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving	May 13, 2025	3D visual groundingAutonomous Driving	CodeCode Available	1
Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI	May 9, 2025	4kDomain Generalization	CodeCode Available	0
DenseGrounding: Improving Dense Language-Vision Semantics for Ego-Centric 3D Visual Grounding	May 8, 2025	3D visual groundingcross-modal alignment	—Unverified	0
AS3D: 2D-Assisted Cross-Modal Understanding with Semantic-Spatial Scene Graphs for 3D Visual Grounding	May 7, 2025	3D visual groundingGraph Attention	CodeCode Available	0
3DWG: 3D Weakly Supervised Visual Grounding via Category and Instance-Level Alignment	May 3, 2025	SentenceVisual Grounding	—Unverified	0
VIST-GPT: Ushering in the Era of Visual Storytelling with LLMs?	Apr 27, 2025	Visual GroundingVisual Storytelling	—Unverified	0
Revisiting Data Auditing in Large Vision-Language Models	Apr 25, 2025	Visual Grounding	—Unverified	0

Show:10 25 50

← PrevPage 2 of 23Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified