Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 301–350 of 571 papers

Title	Date	Tasks	Status
Improved Visual Grounding through Self-Consistent Explanations	Dec 7, 2023	Language ModellingLarge Language Model	—Unverified
Improving Visually Grounded Sentence Representations with Self-Attention	Dec 2, 2017	SentenceVisual Grounding	—Unverified
Individuation in Neural Models with and without Visual Grounding	Sep 27, 2024	Visual Grounding	—Unverified
Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention	May 28, 2024	3D Object Detection3D visual grounding	—Unverified
Interactive Reinforcement Learning for Object Grounding via Self-Talking	Dec 2, 2017	Objectreinforcement-learning	—Unverified
Interactive Visual Grounding of Referring Expressions for Human-Robot Interaction	Jun 11, 2018	Question GenerationQuestion-Generation	—Unverified
Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining	Aug 1, 2018	Question AnsweringVisual Grounding	—Unverified
Interpretable Visual Question Answering via Reasoning Supervision	Sep 7, 2023	Common Sense ReasoningQuestion Answering	—Unverified
INVIGORATE: Interactive Visual Grounding and Grasping in Clutter	Aug 25, 2021	BlockingObject	—Unverified
I Speak and You Find: Robust 3D Visual Grounding with Noisy and Ambiguous Speech Inputs	Jun 17, 2025	3D visual groundingContrastive Learning	—Unverified
Joint Top-Down and Bottom-Up Frameworks for 3D Visual Grounding	Oct 21, 2024	3D visual groundingObject	—Unverified
Knowledge Supports Visual Language Grounding: A Case Study on Colour Terms	Jul 1, 2020	DiagnosticObject	—Unverified
Language-Guided 3D Object Detection in Point Cloud for Autonomous Driving	May 25, 2023	3D Object DetectionAutonomous Driving	—Unverified
Language learning using Speech to Image retrieval	Sep 9, 2019	Grounded language learningImage Retrieval	—Unverified
LanguageRefer: Spatial-Language Model for 3D Visual Grounding	Jul 7, 2021	3D visual groundingLanguage Modeling	—Unverified
LCV2: An Efficient Pretraining-Free Framework for Grounded Visual Question Answering	Jan 29, 2024	Language ModelingLanguage Modelling	—Unverified
Learning from Synthetic Data for Visual Grounding	Mar 20, 2024	Language ModellingLarge Language Model	—Unverified
Visually Consistent Hierarchical Image Classification	Jun 17, 2024	Classificationimage-classification	—Unverified
Learning Language Structures through Grounding	Jun 14, 2024	Automatic Speech RecognitionDependency Parsing	—Unverified
Learning to Compose and Reason with Language Tree Structures for Visual Grounding	Jun 5, 2019	Visual GroundingVisual Reasoning	—Unverified
Learning to Ground VLMs without Forgetting	Oct 14, 2024	DecoderLanguage Modelling	—Unverified
Learning Unsupervised Visual Grounding Through Semantic Self-Supervision	Mar 17, 2018	Visual Grounding	—Unverified
Learning Visual Grounding from Generative Vision and Language Model	Jul 18, 2024	AttributeLanguage Modeling	—Unverified
Learning with Difference Attention for Visually Grounded Self-supervised Representations	Jun 26, 2023	Self-Supervised LearningVisual Grounding	—Unverified
Less is More: Generating Grounded Navigation Instructions from Landmarks	Nov 25, 2021	DecoderInstruction Following	—Unverified
Leveraging Multimodal-LLMs Assisted by Instance Segmentation for Intelligent Traffic Monitoring	Feb 16, 2025	Instance SegmentationLanguage Modeling	—Unverified
Leveraging Past References for Robust Language Grounding	Nov 1, 2019	ObjectReferring Expression	—Unverified
LidaRefer: Outdoor 3D Visual Grounding for Autonomous Driving with Transformers	Nov 7, 2024	3D visual groundingAutonomous Driving	—Unverified
Lightweight In-Context Tuning for Multimodal Unified Models	Oct 8, 2023	Image CaptioningIn-Context Learning	—Unverified
Like a bilingual baby: The advantage of visually grounding a bilingual language model	Oct 11, 2022	Language ModelingLanguage Modelling	—Unverified
LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding	May 27, 2024	Visual Grounding	—Unverified
LQMFormer: Language-aware Query Mask Transformer for Referring Image Segmentation	Jan 1, 2024	Image SegmentationSemantic Segmentation	—Unverified
M4CXR: Exploring Multi-task Potentials of Multi-modal Large Language Models for Chest X-ray Interpretation	Aug 29, 2024	Instruction FollowingMedical Report Generation	—Unverified
MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning	Oct 9, 2022	Image-text Retrievalmultimodal interaction	—Unverified
Medical Phrase Grounding with Region-Phrase Context Contrastive Alignment	Mar 14, 2023	Medical Image AnalysisPhrase Grounding	—Unverified
MedRG: Medical Report Grounding with Multi-modal Large Language Model	Apr 10, 2024	DecoderLanguage Modeling	—Unverified
MedSG-Bench: A Benchmark for Medical Image Sequences Grounding	May 17, 2025	Visual GroundingVisual Question Answering (VQA)	—Unverified
Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration	May 27, 2025	HallucinationVisual Grounding	—Unverified
MMR: Evaluating Reading Ability of Large Multimodal Models	Aug 26, 2024	Font RecognitionMMR total	—Unverified
MNER-QG: An End-to-End MRC framework for Multimodal Named Entity Recognition with Query Grounding	Nov 27, 2022	named-entity-recognitionNamed Entity Recognition	—Unverified
MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs	Jun 2, 2025	Instruction FollowingText Generation	—Unverified
More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models	May 23, 2025	DiagnosticHallucination	—Unverified
Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level	Nov 15, 2024	Benchmarkingcounterfactual	—Unverified
Movie Box Office Prediction With Self-Supervised and Visually Grounded Pretraining	Apr 20, 2023	Visual Grounding	—Unverified
mRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation	May 29, 2025	Question AnsweringRAG	—Unverified
Multi-Granularity Modularized Network for Abstract Visual Reasoning	Jul 9, 2020	Visual GroundingVisual Reasoning	—Unverified
Multimodal Reference Visual Grounding	Apr 2, 2025	Few-Shot Object DetectionVisual Grounding	—Unverified
Multimodal Unified Attention Networks for Vision-and-Language Interactions	Aug 12, 2019	Question AnsweringVisual Grounding	—Unverified
Multi-task Learning of Hierarchical Vision-Language Representation	Dec 3, 2018	Multi-Task LearningQuestion Answering	—Unverified
NanoMVG: USV-Centric Low-Power Multi-Task Visual Grounding based on Prompt-Guided Camera and 4D mmWave Radar	Aug 30, 2024	Autonomous DrivingVisual Grounding	—Unverified

Show:10 25 50

← PrevPage 7 of 12Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified