Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 201–250 of 571 papers

Title	Date	Tasks	Status	Hype
NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning	Feb 1, 2025	Referring ExpressionVisual Grounding	CodeCode Available	1
Panoptic Narrative Grounding	Sep 10, 2021	Natural Language Visual GroundingPanoptic Segmentation	CodeCode Available	1
Visual Grounding in Video for Unsupervised Word Translation	Mar 11, 2020	TranslationVisual Grounding	CodeCode Available	1
Visual Grounding of Learned Physical Models	Apr 28, 2020	Visual Grounding	CodeCode Available	1
EAGLE: Enhanced Visual Grounding Minimizes Hallucinations in Instructional Multimodal Models	Jan 6, 2025	HallucinationVisual Grounding	—Unverified	0
Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding	Sep 28, 2022	DecoderVisual Grounding	—Unverified	0
Dynamic Inference With Grounding Based Vision and Language Models	Jan 1, 2023	Language ModellingReferring Expression	—Unverified	0
Bridging Modality Gap for Visual Grounding with Effecitve Cross-modal Distillation	Dec 29, 2023	Visual Grounding	—Unverified	0
A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding	Jul 9, 2025	3D visual groundingAutonomous Navigation	—Unverified	0
Movie Box Office Prediction With Self-Supervised and Visually Grounded Pretraining	Apr 20, 2023	Visual Grounding	—Unverified	0
LanguageRefer: Spatial-Language Model for 3D Visual Grounding	Jul 7, 2021	3D visual groundingLanguage Modeling	—Unverified	0
ACTRESS: Active Retraining for Semi-supervised Visual Grounding	Jul 3, 2024	Binary ClassificationVisual Grounding	—Unverified	0
More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models	May 23, 2025	DiagnosticHallucination	—Unverified	0
MNER-QG: An End-to-End MRC framework for Multimodal Named Entity Recognition with Query Grounding	Nov 27, 2022	named-entity-recognitionNamed Entity Recognition	—Unverified	0
BlenderAlchemy: Editing 3D Graphics with Vision-Language Models	Apr 26, 2024	Game DesignImage Generation	—Unverified	0
MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs	Jun 2, 2025	Instruction FollowingText Generation	—Unverified	0
Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level	Nov 15, 2024	Benchmarkingcounterfactual	—Unverified	0
Data-Efficient 3D Visual Grounding via Order-Aware Referring	Mar 25, 2024	3D visual groundingObject	—Unverified	0
Joint Top-Down and Bottom-Up Frameworks for 3D Visual Grounding	Oct 21, 2024	3D visual groundingObject	—Unverified	0
Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation	May 24, 2025	Mathematical ReasoningMultimodal Reasoning	—Unverified	0
I Speak and You Find: Robust 3D Visual Grounding with Noisy and Ambiguous Speech Inputs	Jun 17, 2025	3D visual groundingContrastive Learning	—Unverified	0
Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs	Jun 5, 2025	cross-modal alignmentDense Captioning	—Unverified	0
Beyond Object Categories: Multi-Attribute Reference Understanding for Visual Grounding	Mar 25, 2025	AttributeObject	—Unverified	0
MMR: Evaluating Reading Ability of Large Multimodal Models	Aug 26, 2024	Font RecognitionMMR total	—Unverified	0
A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical Image Analysis	Oct 31, 2023	DescriptiveMedical Image Analysis	—Unverified	0
3D Spatial Understanding in MLLMs: Disambiguation and Evaluation	Dec 9, 2024	3D dense captioning3D visual grounding	—Unverified	0
Interpretable Visual Question Answering via Reasoning Supervision	Sep 7, 2023	Common Sense ReasoningQuestion Answering	—Unverified	0
Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration	May 27, 2025	HallucinationVisual Grounding	—Unverified	0
Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining	Aug 1, 2018	Question AnsweringVisual Grounding	—Unverified	0
Interactive Visual Grounding of Referring Expressions for Human-Robot Interaction	Jun 11, 2018	Question GenerationQuestion-Generation	—Unverified	0
Differentiable Parsing and Visual Grounding of Natural Language Instructions for Object Placement	Oct 1, 2022	Graph Neural NetworkObject	—Unverified	0
INVIGORATE: Interactive Visual Grounding and Grasping in Clutter	Aug 25, 2021	BlockingObject	—Unverified	0
Interactive Reinforcement Learning for Object Grounding via Self-Talking	Dec 2, 2017	Objectreinforcement-learning	—Unverified	0
Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention	May 28, 2024	3D Object Detection3D visual grounding	—Unverified	0
Differentiable Disentanglement Filter: an Application Agnostic Core Concept Discovery Probe	Sep 4, 2019	DisentanglementVisual Grounding	—Unverified	0
Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment	Mar 27, 2019	Image RetrievalPhrase Grounding	—Unverified	0
Differentiable Disentanglement Filter: an Application Agnostic Core Concept Discovery Probe	Jul 17, 2019	DisentanglementVisual Grounding	—Unverified	0
Knowledge Supports Visual Language Grounding: A Case Study on Colour Terms	Jul 1, 2020	DiagnosticObject	—Unverified	0
Benchmarking Diverse-Modal Entity Linking with Generative Models	May 27, 2023	BenchmarkingDecoder	—Unverified	0
AIFit: Automatic 3D Human-Interpretable Feedback Models for Fitness Training	Jun 19, 2021	Visual Grounding	—Unverified	0
Individuation in Neural Models with and without Visual Grounding	Sep 27, 2024	Visual Grounding	—Unverified	0
Language-Guided 3D Object Detection in Point Cloud for Autonomous Driving	May 25, 2023	3D Object DetectionAutonomous Driving	—Unverified	0
Detecting Concrete Visual Tokens for Multimodal Machine Translation	Mar 5, 2024	Machine TranslationMultimodal Machine Translation	—Unverified	0
Being data-driven is not enough: Revisiting interactive instruction giving as a challenge for NLG	Nov 1, 2018	Text GenerationVisual Grounding	—Unverified	0
MedSG-Bench: A Benchmark for Medical Image Sequences Grounding	May 17, 2025	Visual GroundingVisual Question Answering (VQA)	—Unverified	0
DSM: Building A Diverse Semantic Map for 3D Visual Grounding	Apr 11, 2025	3D visual groundingScene Understanding	—Unverified	0
LCV2: An Efficient Pretraining-Free Framework for Grounded Visual Question Answering	Jan 29, 2024	Language ModelingLanguage Modelling	—Unverified	0
Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding	Jun 13, 2024	3D visual groundingAttribute	—Unverified	0
Improving Visually Grounded Sentence Representations with Self-Attention	Dec 2, 2017	SentenceVisual Grounding	—Unverified	0
DenseGrounding: Improving Dense Language-Vision Semantics for Ego-Centric 3D Visual Grounding	May 8, 2025	3D visual groundingcross-modal alignment	—Unverified	0

Show:10 25 50

← PrevPage 5 of 12Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified