Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 526–550 of 571 papers

Title	Date	Tasks	Status
LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation	Mar 18, 2025	DecoderObject	CodeCode Available
Discovering the Unknown Knowns: Turning Implicit Knowledge in the Dataset into Explicit Training Examples for Visual Question Answering	Sep 13, 2021	Data AugmentationQuestion Answering	CodeCode Available
Learning Two-Branch Neural Networks for Image-Text Matching Tasks	Apr 11, 2017	Image-text matchingRetrieval	CodeCode Available
SiRi: A Simple Selective Retraining Mechanism for Transformer-based Visual Grounding	Jul 27, 2022	Visual Grounding	CodeCode Available
Learning to ground medical text in a 3D human atlas	Nov 1, 2020	Phrase GroundingVisual Grounding	CodeCode Available
Smart Vision-Language Reasoners	Jul 5, 2024	MathMathematical Reasoning	CodeCode Available
Learning semantic sentence representations from visually grounded language without lexical knowledge	Mar 27, 2019	Grounded language learningLearning Semantic Representations	CodeCode Available
SOrT-ing VQA Models : Contrastive Gradient Learning for Improved Consistency	Oct 20, 2020	Question AnsweringVisual Grounding	CodeCode Available
Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic Representation	May 24, 2021	Referring ExpressionReferring Expression Comprehension	CodeCode Available
DetermiNet: A Large-Scale Diagnostic Dataset for Complex Visually-Grounded Referencing using Determiners	Sep 7, 2023	DiagnosticVisual Grounding	CodeCode Available
Adversarial Robustness for Visual Grounding of Multimodal Large Language Models	May 16, 2024	Adversarial AttackAdversarial Robustness	CodeCode Available
Language with Vision: a Study on Grounded Word and Sentence Embeddings	Jun 17, 2022	SentenceSentence Embeddings	CodeCode Available
Adaptive Masking Enhances Visual Grounding	Oct 4, 2024	Few-Shot LearningVisual Grounding	CodeCode Available
Deconfounded Visual Grounding	Dec 31, 2021	Referring ExpressionVisual Grounding	CodeCode Available
Visually Grounded VQA by Lattice-based Retrieval	Nov 15, 2022	Information RetrievalQuestion Answering	CodeCode Available
Language-Guided Diffusion Model for Visual Grounding	Aug 18, 2023	cross-modal alignmentDenoising	CodeCode Available
Language Adaptive Weight Generation for Multi-task Visual Grounding	Jun 6, 2023	Referring ExpressionReferring Expression Comprehension	CodeCode Available
Beyond task success: A closer look at jointly learning to see, ask, and GuessWhat	Sep 10, 2018	Multi-Task LearningReinforcement Learning	CodeCode Available
Collecting Visually-Grounded Dialogue with A Game Of Sorts	Sep 10, 2023	Coreference ResolutionImage Retrieval	CodeCode Available
InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot Interactions	Oct 18, 2023	BenchmarkingVisual Grounding	CodeCode Available
CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays	May 23, 2025	DiagnosticQuestion Answering	CodeCode Available
Investigating Compositional Challenges in Vision-Language Models for Visual Grounding	Jan 1, 2024	AttributeRelation	CodeCode Available
CoF: Coarse to Fine-Grained Image Understanding for Multi-modal Large Language Models	Dec 22, 2024	Language ModelingLanguage Modelling	CodeCode Available
HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation	May 16, 2025	BenchmarkingEthics	CodeCode Available
Answer Questions with Right Image Regions: A Visual Attention Regularization Approach	Feb 3, 2021	Question AnsweringVisual Grounding	CodeCode Available

Show:10 25 50

← PrevPage 22 of 23Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified