Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 201–250 of 571 papers

Title	Date	Tasks	Status	Hype
Learning Cross-modal Context Graph for Visual Grounding	Feb 13, 2020	Graph MatchingGraph Neural Network	CodeCode Available	1
Learning Cross-modal Context Graph for Visual Grounding	Nov 20, 2019	Graph MatchingGraph Neural Network	CodeCode Available	1
A Fast and Accurate One-Stage Approach to Visual Grounding	Aug 18, 2019	Referring ExpressionReferring Expression Comprehension	CodeCode Available	1
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks	Aug 6, 2019	Image RetrievalQuestion Answering	CodeCode Available	1
ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition	Jul 15, 2025	3D visual groundingVisual Grounding	—Unverified	0
A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding	Jul 9, 2025	3D visual groundingAutonomous Navigation	—Unverified	0
VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation	Jul 9, 2025	Backdoor AttackVisual Grounding	—Unverified	0
SPAZER: Spatial-Semantic Progressive Reasoning Agent for Zero-shot 3D Visual Grounding	Jun 27, 2025	3D visual groundingNatural Language Queries	—Unverified	0
DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document Images	Jun 26, 2025	document understandingOptical Character Recognition (OCR)	CodeCode Available	0
HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation	Jun 26, 2025	counterfactualCounterfactual Reasoning	—Unverified	0
GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding	Jun 26, 2025	3D visual groundingLarge Language Model	—Unverified	0
GEMeX-ThinkVG: Towards Thinking with Visual Grounding in Medical VQA via Reinforcement Learning	Jun 22, 2025	Answer GenerationDecision Making	—Unverified	0
I Speak and You Find: Robust 3D Visual Grounding with Noisy and Ambiguous Speech Inputs	Jun 17, 2025	3D visual groundingContrastive Learning	—Unverified	0
Unified Representation Space for 3D Visual Grounding	Jun 17, 2025	3D visual groundingContrastive Learning	—Unverified	0
Semantic Localization Guiding Segment Anything Model For Reference Remote Sensing Image Segmentation	Jun 12, 2025	Image SegmentationSegmentation	—Unverified	0
EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments	Jun 9, 2025	BenchmarkingNavigate	—Unverified	0
Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs	Jun 5, 2025	cross-modal alignmentDense Captioning	—Unverified	0
Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning	Jun 5, 2025	MathVisual Grounding	—Unverified	0
From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes	Jun 5, 2025	3D visual groundingObject	—Unverified	0
RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought	Jun 4, 2025	Multimodal ReasoningReasoning Segmentation	—Unverified	0
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents	Jun 3, 2025	Visual Grounding	—Unverified	0
MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs	Jun 2, 2025	Instruction FollowingText Generation	—Unverified	0
D2AF: A Dual-Driven Annotation and Filtering Framework for Visual Grounding	May 30, 2025	DiversityPseudo Label	—Unverified	0
mRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation	May 29, 2025	Question AnsweringRAG	—Unverified	0
Zero-Shot 3D Visual Grounding from Vision-Language Models	May 28, 2025	3D visual groundingVisual Grounding	—Unverified	0
Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration	May 27, 2025	HallucinationVisual Grounding	—Unverified	0
Two Causally Related Needles in a Video Haystack	May 26, 2025	Video UnderstandingVisual Grounding	—Unverified	0
Unveiling the Compositional Ability Gap in Vision-Language Reasoning Model	May 26, 2025	DiagnosticReinforcement Learning (RL)	CodeCode Available	0
Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation	May 24, 2025	Mathematical ReasoningMultimodal Reasoning	—Unverified	0
More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models	May 23, 2025	DiagnosticHallucination	—Unverified	0
CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays	May 23, 2025	DiagnosticQuestion Answering	CodeCode Available	0
Redemption Score: An Evaluation Framework to Rank Image Captions While Redeeming Image Semantics and Language Pragmatics	May 22, 2025	Image Captioningtext similarity	—Unverified	0
Training-Free Reasoning and Reflection in MLLMs	May 22, 2025	DecoderMultimodal Reasoning	—Unverified	0
Seeing the Trees for the Forest: Rethinking Weakly-Supervised Medical Visual Grounding	May 21, 2025	Visual Grounding	—Unverified	0
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning	May 20, 2025	Large Language ModelMultimodal Large Language Model	—Unverified	0
TinyRS-R1: Compact Multimodal Language Model for Remote Sensing	May 17, 2025	Language ModelingLanguage Modelling	—Unverified	0
UniMoCo: Unified Modality Completion for Robust Multi-Modal Embeddings	May 17, 2025	Image to textInformation Retrieval	CodeCode Available	0
MedSG-Bench: A Benchmark for Medical Image Sequences Grounding	May 17, 2025	Visual GroundingVisual Question Answering (VQA)	—Unverified	0
HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation	May 16, 2025	BenchmarkingEthics	CodeCode Available	0
Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI	May 9, 2025	4kDomain Generalization	CodeCode Available	0
DenseGrounding: Improving Dense Language-Vision Semantics for Ego-Centric 3D Visual Grounding	May 8, 2025	3D visual groundingcross-modal alignment	—Unverified	0
AS3D: 2D-Assisted Cross-Modal Understanding with Semantic-Spatial Scene Graphs for 3D Visual Grounding	May 7, 2025	3D visual groundingGraph Attention	CodeCode Available	0
3DWG: 3D Weakly Supervised Visual Grounding via Category and Instance-Level Alignment	May 3, 2025	SentenceVisual Grounding	—Unverified	0
VIST-GPT: Ushering in the Era of Visual Storytelling with LLMs?	Apr 27, 2025	Visual GroundingVisual Storytelling	—Unverified	0
Revisiting Data Auditing in Large Vision-Language Models	Apr 25, 2025	Visual Grounding	—Unverified	0
Visual Intention Grounding for Egocentric Assistants	Apr 18, 2025	ObjectVisual Grounding	—Unverified	0
COUNTS: Benchmarking Object Detectors and Multimodal Large Language Models under Distribution Shifts	Apr 14, 2025	BenchmarkingObject	—Unverified	0
Ges3ViG: Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference Understanding	Apr 13, 2025	3D visual groundingData Augmentation	CodeCode Available	0
DSM: Building A Diverse Semantic Map for 3D Visual Grounding	Apr 11, 2025	3D visual groundingScene Understanding	—Unverified	0
AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations	Apr 10, 2025	Spatial ReasoningVisual Grounding	—Unverified	0

Show:10 25 50

← PrevPage 5 of 12Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified