Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 251–275 of 571 papers

Title	Date	Tasks	Status	Score
Revisiting Visual Question Answering Baselines	Jun 27, 2016	Binary ClassificationMultiple-choice	CodeCode Available	5
GROOViST: A Metric for Grounding Objects in Visual Storytelling	Oct 26, 2023	Visual GroundingVisual Storytelling	CodeCode Available	5
AttnGrounder: Talking to Cars with Attention	Sep 11, 2020	Referring Expression ComprehensionVisual Grounding	CodeCode Available	5
Ges3ViG : Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference Understanding	Jan 1, 2025	3D visual groundingData Augmentation	CodeCode Available	5
Cost-Effective Language Driven Image Editing with LX-DRIM	Oct 1, 2022	Visual Grounding	CodeCode Available	5
Ges3ViG: Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference Understanding	Apr 13, 2025	3D visual groundingData Augmentation	CodeCode Available	5
Cosine meets Softmax: A tough-to-beat baseline for visual grounding	Sep 13, 2020	Autonomous DrivingMetric Learning	CodeCode Available	5
Attention as Grounding: Exploring Textual and Cross-Modal Attention on Entities and Relations in Language-and-Vision Transformer	May 1, 2022	Text GenerationVisual Grounding	CodeCode Available	5
Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models	Dec 11, 2024	Question AnsweringVisual Grounding	CodeCode Available	5
Phrase Decoupling Cross-Modal Hierarchical Matching and Progressive Position Correction for Visual Grounding	Oct 31, 2024	ObjectPosition	CodeCode Available	5
Context-Infused Visual Grounding for Art	Oct 16, 2024	object-detectionObject Detection	CodeCode Available	5
Context Does Matter: End-to-end Panoptic Narrative Grounding with Deformable Attention Refined Matching Network	Oct 25, 2023	Visual Grounding	CodeCode Available	5
G2D: From Global to Dense Radiography Representation Learning via Vision-Language Pre-training	Dec 3, 2023	object-detectionObject Detection	CodeCode Available	5
Adversarial Robustness for Visual Grounding of Multimodal Large Language Models	May 16, 2024	Adversarial AttackAdversarial Robustness	CodeCode Available	5
HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation	May 16, 2025	BenchmarkingEthics	CodeCode Available	5
Flexible Visual Grounding	May 1, 2022	ArticlesVisual Grounding	CodeCode Available	5
FiVL: A Framework for Improved Vision-Language Alignment	Dec 19, 2024	Answer GenerationMultimodal Reasoning	CodeCode Available	5
Connecting Vision and Language with Localized Narratives	Dec 6, 2019	FormImage Captioning	CodeCode Available	5
Composing Pick-and-Place Tasks By Grounding Language	Feb 16, 2021	Natural Language Visual GroundingRobotic Grasping	CodeCode Available	5
Finding beans in burgers: Deep semantic-visual embedding with localization	Apr 5, 2018	Cross-Modal RetrievalImage Captioning	CodeCode Available	5
Neural Twins Talk	Sep 26, 2020	Image CaptioningSentence	CodeCode Available	5
NICE: Improving Panoptic Narrative Detection and Segmentation with Cascading Collaborative Learning	Oct 17, 2023	SegmentationVisual Grounding	CodeCode Available	5
Collecting Visually-Grounded Dialogue with A Game Of Sorts	Sep 10, 2023	Coreference ResolutionImage Retrieval	CodeCode Available	5
Few-Shot Multimodal Explanation for Visual Question Answering	Oct 28, 2024	Explainable artificial intelligenceExplainable Artificial Intelligence (XAI)	CodeCode Available	5
AS3D: 2D-Assisted Cross-Modal Understanding with Semantic-Spatial Scene Graphs for 3D Visual Grounding	May 7, 2025	3D visual groundingGraph Attention	CodeCode Available	5

Show:10 25 50

← PrevPage 11 of 23Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified