Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 301–350 of 571 papers

Title	Date	Tasks	Status	Score
World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering	Sep 30, 2024	Optical Character Recognition (OCR)Question Answering	CodeCode Available	5
You Only Look & Listen Once: Towards Fast and Accurate Visual Grounding	Feb 12, 2019	object-detectionObject Detection	CodeCode Available	5
MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs	Jun 2, 2025	Instruction FollowingText Generation	—Unverified	0
Visual Intention Grounding for Egocentric Assistants	Apr 18, 2025	ObjectVisual Grounding	—Unverified	0
Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation	Jan 28, 2017	Response GenerationRetrieval	—Unverified	0
More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models	May 23, 2025	DiagnosticHallucination	—Unverified	0
Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level	Nov 15, 2024	Benchmarkingcounterfactual	—Unverified	0
Movie Box Office Prediction With Self-Supervised and Visually Grounded Pretraining	Apr 20, 2023	Visual Grounding	—Unverified	0
Image Difference Grounding with Natural Language	Apr 2, 2025	Visual Grounding	—Unverified	0
Illustrative Language Understanding: Large-Scale Visual Grounding with Image Search	Jul 1, 2018	General ClassificationImage Retrieval	—Unverified	0
mRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation	May 29, 2025	Question AnsweringRAG	—Unverified	0
HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task	Jun 4, 2024	Head Pose EstimationLanguage Modelling	—Unverified	0
Visually grounded cross-lingual keyword spotting in speech	Jun 13, 2018	Keyword SpottingVisual Grounding	—Unverified	0
HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model	Jun 1, 2024	Action RecognitionActivity Recognition	—Unverified	0
HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation	Jun 26, 2025	counterfactualCounterfactual Reasoning	—Unverified	0
Multi-Granularity Modularized Network for Abstract Visual Reasoning	Jul 9, 2020	Visual GroundingVisual Reasoning	—Unverified	0
Visually Grounded Neural Syntax Acquisition	Jun 7, 2019	Visual Grounding	—Unverified	0
Guiding Visual Question Answering with Attention Priors	May 25, 2022	Question AnsweringVisual Grounding	—Unverified	0
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents	Jun 3, 2025	Visual Grounding	—Unverified	0
Multimodal Reference Visual Grounding	Apr 2, 2025	Few-Shot Object DetectionVisual Grounding	—Unverified	0
Multimodal Unified Attention Networks for Vision-and-Language Interactions	Aug 12, 2019	Question AnsweringVisual Grounding	—Unverified	0
Multi-task Learning of Hierarchical Vision-Language Representation	Dec 3, 2018	Multi-Task LearningQuestion Answering	—Unverified	0
GRAPPA: Generalizing and Adapting Robot Policies via Online Agentic Guidance	Oct 9, 2024	Visual Grounding	—Unverified	0
GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding	Jun 26, 2025	3D visual groundingLarge Language Model	—Unverified	0
NanoMVG: USV-Centric Low-Power Multi-Task Visual Grounding based on Prompt-Guided Camera and 4D mmWave Radar	Aug 30, 2024	Autonomous DrivingVisual Grounding	—Unverified	0
Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners	Apr 30, 2024	3D visual groundingVisual Grounding	—Unverified	0
GroundCap: A Visually Grounded Image Captioning Dataset	Feb 19, 2025	Image CaptioningObject Detection	—Unverified	0
Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models	Oct 21, 2024	Instruction Followingobject-detection	—Unverified	0
Neural Material Adaptor for Visual Grounding of Intrinsic Dynamics	Oct 10, 2024	Visual Grounding	—Unverified	0
Neural Slot Interpreters: Grounding Object Semantics in Emergent Slot Representations	Feb 2, 2024	Contrastive LearningObject	—Unverified	0
Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding	Mar 8, 2025	Language ModelingLanguage Modelling	—Unverified	0
Giving Commands to a Self-driving Car: A Multimodal Reasoner for Visual Grounding	Mar 19, 2020	ObjectReferring Expression Comprehension	—Unverified	0
A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding	Jul 9, 2025	3D visual groundingAutonomous Navigation	—Unverified	0
Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment	Mar 27, 2019	Image RetrievalPhrase Grounding	—Unverified	0
NuGrounding: A Multi-View 3D Visual Grounding Framework in Autonomous Driving	Mar 28, 2025	3D visual groundingAutonomous Driving	—Unverified	0
Object2Scene: Putting Objects in Context for Open-Vocabulary 3D Detection	Sep 18, 2023	3D Object Detection3D Open-Vocabulary Object Detection	—Unverified	0
GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing	Jan 12, 2025	Image CaptioningLanguage Modeling	—Unverified	0
OG: Equip vision occupancy with instance segmentation and visual grounding	Jul 12, 2023	Instance SegmentationSegmentation	—Unverified	0
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web	Feb 27, 2024	Language ModelingLanguage Modelling	—Unverified	0
Omni-Q: Omni-Directional Scene Understanding for Unsupervised Visual Grounding	Jan 1, 2024	Scene UnderstandingVisual Grounding	—Unverified	0
GEMeX-ThinkVG: Towards Thinking with Visual Grounding in Medical VQA via Reinforcement Learning	Jun 22, 2025	Answer GenerationDecision Making	—Unverified	0
On the Contributions of Visual and Textual Supervision in Low-Resource Semantic Speech Retrieval	Apr 24, 2019	RetrievalVisual Grounding	—Unverified	0
On the Role of Visual Grounding in VQA	Jun 26, 2024	Visual GroundingVisual Question Answering (VQA)	—Unverified	0
GAGS: Granularity-Aware Feature Distillation for Language Gaussian Splatting	Dec 18, 2024	Scene UnderstandingSemantic Segmentation	—Unverified	0
Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models	Jul 18, 2024	3D Semantic SegmentationSemantic Segmentation	—Unverified	0
OptiBox: Breaking the Limits of Proposals for Visual Grounding	Nov 29, 2019	Image CaptioningVisual Grounding	—Unverified	0
GAFNet: A Global Fourier Self Attention Based Novel Network for multi-modal downstream tasks	Jan 1, 2023	Image GenerationImage-text Retrieval	—Unverified	0
Overcoming Language Priors in Visual Question Answering with Adversarial Regularization	Oct 8, 2018	Question AnsweringVisual Grounding	—Unverified	0
G^3-LQ: Marrying Hyperbolic Alignment with Explicit Semantic-Geometric Modeling for 3D Visual Grounding	Jan 1, 2024	3D visual groundingVisual Grounding	—Unverified	0
Paint Outside the Box: Synthesizing and Selecting Training Data for Visual Grounding	Dec 1, 2024	Visual Grounding	—Unverified	0

Show:10 25 50

← PrevPage 7 of 12Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified