Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 501–550 of 571 papers

Title	Date	Tasks	Status
ScanERU: Interactive 3D Visual Grounding based on Embodied Reference Understanding	Mar 23, 2023	3D visual groundingVisual Grounding	CodeCode Available
You Only Look & Listen Once: Towards Fast and Accurate Visual Grounding	Feb 12, 2019	object-detectionObject Detection	CodeCode Available
AS3D: 2D-Assisted Cross-Modal Understanding with Semantic-Spatial Scene Graphs for 3D Visual Grounding	May 7, 2025	3D visual groundingGraph Attention	CodeCode Available
Enhancing Interpretability and Interactivity in Robot Manipulation: A Neurosymbolic Approach	Oct 3, 2022	Referring ExpressionRobot Manipulation	CodeCode Available
SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph Attention	Mar 13, 2024	3D visual groundingcross-modal alignment	CodeCode Available
Finding beans in burgers: Deep semantic-visual embedding with localization	Apr 5, 2018	Cross-Modal RetrievalImage Captioning	CodeCode Available
Few-Shot Multimodal Explanation for Visual Question Answering	Oct 28, 2024	Explainable artificial intelligenceExplainable Artificial Intelligence (XAI)	CodeCode Available
Multi-Attribute Interactions Matter for 3D Visual Grounding	Jan 1, 2024	3D visual groundingAttribute	CodeCode Available
Unveiling the Compositional Ability Gap in Vision-Language Reasoning Model	May 26, 2025	DiagnosticReinforcement Learning (RL)	CodeCode Available
Composing Pick-and-Place Tasks By Grounding Language	Feb 16, 2021	Natural Language Visual GroundingRobotic Grasping	CodeCode Available
Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model	Jul 7, 2024	SegmentationSentence	CodeCode Available
World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering	Sep 30, 2024	Optical Character Recognition (OCR)Question Answering	CodeCode Available
Modularized Textual Grounding for Counterfactual Resilience	Apr 7, 2019	Attributecounterfactual	CodeCode Available
Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment	Dec 5, 2023	Explanation GenerationVisual Grounding	CodeCode Available
Measuring Faithful and Plausible Visual Grounding in VQA	May 24, 2023	Question AnsweringVisual Grounding	CodeCode Available
MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs	Oct 16, 2024	Visual Grounding	CodeCode Available
Self-view Grounding Given a Narrated 360° Video	Nov 23, 2017	SentenceVisual Grounding	CodeCode Available
Dual Attention Networks for Visual Reference Resolution in Visual Dialog	Feb 25, 2019	AI AgentQuestion Answering	CodeCode Available
Semantic query-by-example speech search using visual grounding	Apr 15, 2019	RetrievalSemantic Retrieval	CodeCode Available
DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document Images	Jun 26, 2025	document understandingOptical Character Recognition (OCR)	CodeCode Available
MB-ORES: A Multi-Branch Object Reasoner for Visual Grounding in Remote Sensing	Mar 31, 2025	Objectobject-detection	CodeCode Available
M^3D: A Multimodal, Multilingual and Multitask Dataset for Grounded Document-level Information Extraction	Dec 5, 2024	Relation ExtractionVisual Grounding	CodeCode Available
Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI	May 9, 2025	4kDomain Generalization	CodeCode Available
Shaking Up VLMs: Comparing Transformers and Structured State Space Models for Vision & Language Modeling	Sep 9, 2024	Language ModelingLanguage Modelling	CodeCode Available
Leverage Points in Modality Shifts: Comparing Language-only and Multimodal Word Representations	Jun 4, 2023	Visual GroundingWord Embeddings	CodeCode Available
LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation	Mar 18, 2025	DecoderObject	CodeCode Available
Discovering the Unknown Knowns: Turning Implicit Knowledge in the Dataset into Explicit Training Examples for Visual Question Answering	Sep 13, 2021	Data AugmentationQuestion Answering	CodeCode Available
Learning Two-Branch Neural Networks for Image-Text Matching Tasks	Apr 11, 2017	Image-text matchingRetrieval	CodeCode Available
SiRi: A Simple Selective Retraining Mechanism for Transformer-based Visual Grounding	Jul 27, 2022	Visual Grounding	CodeCode Available
Learning to ground medical text in a 3D human atlas	Nov 1, 2020	Phrase GroundingVisual Grounding	CodeCode Available
Smart Vision-Language Reasoners	Jul 5, 2024	MathMathematical Reasoning	CodeCode Available
Learning semantic sentence representations from visually grounded language without lexical knowledge	Mar 27, 2019	Grounded language learningLearning Semantic Representations	CodeCode Available
SOrT-ing VQA Models : Contrastive Gradient Learning for Improved Consistency	Oct 20, 2020	Question AnsweringVisual Grounding	CodeCode Available
Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic Representation	May 24, 2021	Referring ExpressionReferring Expression Comprehension	CodeCode Available
DetermiNet: A Large-Scale Diagnostic Dataset for Complex Visually-Grounded Referencing using Determiners	Sep 7, 2023	DiagnosticVisual Grounding	CodeCode Available
Adversarial Robustness for Visual Grounding of Multimodal Large Language Models	May 16, 2024	Adversarial AttackAdversarial Robustness	CodeCode Available
Language with Vision: a Study on Grounded Word and Sentence Embeddings	Jun 17, 2022	SentenceSentence Embeddings	CodeCode Available
Adaptive Masking Enhances Visual Grounding	Oct 4, 2024	Few-Shot LearningVisual Grounding	CodeCode Available
Deconfounded Visual Grounding	Dec 31, 2021	Referring ExpressionVisual Grounding	CodeCode Available
Visually Grounded VQA by Lattice-based Retrieval	Nov 15, 2022	Information RetrievalQuestion Answering	CodeCode Available
Language-Guided Diffusion Model for Visual Grounding	Aug 18, 2023	cross-modal alignmentDenoising	CodeCode Available
Language Adaptive Weight Generation for Multi-task Visual Grounding	Jun 6, 2023	Referring ExpressionReferring Expression Comprehension	CodeCode Available
Beyond task success: A closer look at jointly learning to see, ask, and GuessWhat	Sep 10, 2018	Multi-Task LearningReinforcement Learning	CodeCode Available
Collecting Visually-Grounded Dialogue with A Game Of Sorts	Sep 10, 2023	Coreference ResolutionImage Retrieval	CodeCode Available
InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot Interactions	Oct 18, 2023	BenchmarkingVisual Grounding	CodeCode Available
CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays	May 23, 2025	DiagnosticQuestion Answering	CodeCode Available
Investigating Compositional Challenges in Vision-Language Models for Visual Grounding	Jan 1, 2024	AttributeRelation	CodeCode Available
CoF: Coarse to Fine-Grained Image Understanding for Multi-modal Large Language Models	Dec 22, 2024	Language ModelingLanguage Modelling	CodeCode Available
HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation	May 16, 2025	BenchmarkingEthics	CodeCode Available
Answer Questions with Right Image Regions: A Visual Attention Regularization Approach	Feb 3, 2021	Question AnsweringVisual Grounding	CodeCode Available

Show:10 25 50

← PrevPage 11 of 12Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified