Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 201–225 of 571 papers

Title	Date	Tasks	Status	Hype	Score
LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition	Feb 15, 2024	Grounded Multimodal Named Entity RecognitionMulti-modal Named Entity Recognition	CodeCode Available	1	5
How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game	Mar 13, 2025	Multimodal ReasoningQuestion Answering	CodeCode Available	1	5
Spatially Aware Multimodal Transformers for TextVQA	Jul 23, 2020	Optical Character Recognition (OCR)Spatial Reasoning	CodeCode Available	1	5
UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory	Aug 28, 2023	Question AnsweringRetrieval	CodeCode Available	1	5
ScanERU: Interactive 3D Visual Grounding based on Embodied Reference Understanding	Mar 23, 2023	3D visual groundingVisual Grounding	CodeCode Available	0	5
SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph Attention	Mar 13, 2024	3D visual groundingcross-modal alignment	CodeCode Available	0	5
Dual Attention Networks for Visual Reference Resolution in Visual Dialog	Feb 25, 2019	AI AgentQuestion Answering	CodeCode Available	0	5
RoViST:Learning Robust Metrics for Visual Storytelling	May 8, 2022	SentenceText Generation	CodeCode Available	0	5
DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document Images	Jun 26, 2025	document understandingOptical Character Recognition (OCR)	CodeCode Available	0	5
RoViST: Learning Robust Metrics for Visual Storytelling	Jul 1, 2022	SentenceText Generation	CodeCode Available	0	5
Revisiting Visual Question Answering Baselines	Jun 27, 2016	Binary ClassificationMultiple-choice	CodeCode Available	0	5
Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding	May 9, 2018	DiversityPhrase Grounding	CodeCode Available	0	5
Discovering the Unknown Knowns: Turning Implicit Knowledge in the Dataset into Explicit Training Examples for Visual Question Answering	Sep 13, 2021	Data AugmentationQuestion Answering	CodeCode Available	0	5
Beyond Human Perception: Understanding Multi-Object World from Monocular View	Jan 1, 2025	3D visual groundingDenoising	CodeCode Available	0	5
ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding	Aug 29, 2024	Data AugmentationImage Generation	CodeCode Available	0	5
Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based Localization	Apr 17, 2024	3D dense captioning3D visual grounding	CodeCode Available	0	5
DetermiNet: A Large-Scale Diagnostic Dataset for Complex Visually-Grounded Referencing using Determiners	Sep 7, 2023	DiagnosticVisual Grounding	CodeCode Available	0	5
A Better Loss for Visual-Textual Grounding	Aug 11, 2021	SentenceVisual Grounding	CodeCode Available	0	5
Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models	Dec 3, 2023	HallucinationVisual Grounding	CodeCode Available	0	5
Enhancing Interpretability and Interactivity in Robot Manipulation: A Neurosymbolic Approach	Oct 3, 2022	Referring ExpressionRobot Manipulation	CodeCode Available	0	5
Phrase Decoupling Cross-Modal Hierarchical Matching and Progressive Position Correction for Visual Grounding	Oct 31, 2024	ObjectPosition	CodeCode Available	0	5
Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models	Dec 11, 2024	Question AnsweringVisual Grounding	CodeCode Available	0	5
HuBo-VLM: Unified Vision-Language Model designed for HUman roBOt interaction tasks	Aug 24, 2023	Language ModelingLanguage Modelling	CodeCode Available	0	5
Deconfounded Visual Grounding	Dec 31, 2021	Referring ExpressionVisual Grounding	CodeCode Available	0	5
Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition	Jul 5, 2024	Visual GroundingVisual Storytelling	CodeCode Available	0	5

Show:10 25 50

← PrevPage 9 of 23Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified