Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 351–400 of 571 papers

Title	Date	Tasks	Status
Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners	Apr 30, 2024	3D visual groundingVisual Grounding	—Unverified
Neural Material Adaptor for Visual Grounding of Intrinsic Dynamics	Oct 10, 2024	Visual Grounding	—Unverified
Neural Slot Interpreters: Grounding Object Semantics in Emergent Slot Representations	Feb 2, 2024	Contrastive LearningObject	—Unverified
NuGrounding: A Multi-View 3D Visual Grounding Framework in Autonomous Driving	Mar 28, 2025	3D visual groundingAutonomous Driving	—Unverified
Object2Scene: Putting Objects in Context for Open-Vocabulary 3D Detection	Sep 18, 2023	3D Object Detection3D Open-Vocabulary Object Detection	—Unverified
OG: Equip vision occupancy with instance segmentation and visual grounding	Jul 12, 2023	Instance SegmentationSegmentation	—Unverified
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web	Feb 27, 2024	Language ModelingLanguage Modelling	—Unverified
Omni-Q: Omni-Directional Scene Understanding for Unsupervised Visual Grounding	Jan 1, 2024	Scene UnderstandingVisual Grounding	—Unverified
On the Contributions of Visual and Textual Supervision in Low-Resource Semantic Speech Retrieval	Apr 24, 2019	RetrievalVisual Grounding	—Unverified
On the Role of Visual Grounding in VQA	Jun 26, 2024	Visual GroundingVisual Question Answering (VQA)	—Unverified
Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models	Jul 18, 2024	3D Semantic SegmentationSemantic Segmentation	—Unverified
OptiBox: Breaking the Limits of Proposals for Visual Grounding	Nov 29, 2019	Image CaptioningVisual Grounding	—Unverified
Overcoming Language Priors in Visual Question Answering with Adversarial Regularization	Oct 8, 2018	Question AnsweringVisual Grounding	—Unverified
Paint Outside the Box: Synthesizing and Selecting Training Data for Visual Grounding	Dec 1, 2024	Visual Grounding	—Unverified
Parallel Vertex Diffusion for Unified Visual Grounding	Mar 13, 2023	Visual Grounding	—Unverified
Parameter-Efficient Fine-Tuning Medical Multimodal Large Language Models for Medical Visual Grounding	Oct 31, 2024	parameter-efficient fine-tuningVisual Grounding	—Unverified
PD-APE: A Parallel Decoding Framework with Adaptive Position Encoding for 3D Visual Grounding	Jul 19, 2024	3D visual groundingAttribute	—Unverified
Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning	Jun 5, 2025	MathVisual Grounding	—Unverified
Context-Aware Indoor Point Cloud Object Generation through User Instructions	Nov 26, 2023	PositionVisual Grounding	—Unverified
Polaris: Open-ended Interactive Robotic Manipulation via Syn2Real Visual Grounding and Large Language Models	Aug 15, 2024	Pose EstimationVisual Grounding	—Unverified
Programming with Pixels: Computer-Use Meets Software Engineering	Feb 24, 2025	Visual Grounding	—Unverified
Propagating Over Phrase Relations for One-Stage Visual Grounding	Aug 1, 2020	Phrase GroundingRelational Reasoning	—Unverified
ProxyTransformation: Preshaping Point Cloud Manifold With Proxy Attention For 3D Visual Grounding	Feb 26, 2025	3D visual groundingVisual Grounding	—Unverified
ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning	Mar 30, 2025	3D visual groundingFeature Splatting	—Unverified
Redemption Score: An Evaluation Framework to Rank Image Captions While Redeeming Image Semantics and Language Pragmatics	May 22, 2025	Image Captioningtext similarity	—Unverified
Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder	Jul 13, 2020	Question AnsweringVisual Grounding	—Unverified
ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations	Jan 24, 2025	DecoderObject	—Unverified
Referencing Where to Focus: Improving VisualGrounding with Referential Query	Dec 26, 2024	DecoderVisual Grounding	—Unverified
Joint Visual Grounding with Language Scene Graphs	Jun 9, 2019	Referring ExpressionVisual Grounding	—Unverified
Referring to Screen Texts with Voice Assistants	Jun 10, 2023	NavigateVisual Grounding	—Unverified
Retrieve, Caption, Generate: Visual Grounding for Enhancing Commonsense in Text Generation Models	Sep 8, 2021	Concept-To-Text GenerationSpecificity	—Unverified
Revisiting Data Auditing in Large Vision-Language Models	Apr 25, 2025	Visual Grounding	—Unverified
Revisiting Visual Grounding	Apr 3, 2019	Image RetrievalRetrieval	—Unverified
Right Place, Right Time! Dynamizing Topological Graphs for Embodied Navigation	Mar 14, 2024	Decision MakingLanguage Modeling	—Unverified
Extending CLIP's Image-Text Alignment to Referring Image Segmentation	Jun 14, 2023	Image SegmentationReferring Expression Segmentation	—Unverified
RLS3: RL-Based Synthetic Sample Selection to Enhance Spatial Reasoning in Vision-Language Models for Indoor Autonomous Perception	Jan 31, 2025	Reinforcement Learning (RL)Spatial Reasoning	—Unverified
RoViST: Learning Robust Metrics for Visual Storytelling	Dec 17, 2021	SentenceText Generation	—Unverified
RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data	Oct 23, 2022	Image CaptioningImage-text Retrieval	—Unverified
RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought	Jun 4, 2025	Multimodal ReasoningReasoning Segmentation	—Unverified
Sample-Specific Debiasing for Better Image-Text Models	Apr 25, 2023	Contrastive LearningCross-Modal Retrieval	—Unverified
Scene-Intuitive Agent for Remote Embodied Visual Grounding	Mar 24, 2021	cross-modal alignmentNavigate	—Unverified
SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding	Jan 17, 2024	3D visual groundingScene Understanding	—Unverified
SCO-VIST: Social Interaction Commonsense Knowledge-based Visual Storytelling	Feb 1, 2024	DiversityImage Captioning	—Unverified
Second Place Solution of WSDM2023 Toloka Visual Question Answering Challenge	Jul 5, 2024	Cross-Modal RetrievalQuestion Answering	—Unverified
SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding	Dec 5, 2024	3D visual groundingObject Localization	—Unverified
Seeing Speech and Sound: Distinguishing and Locating Audios in Visual Scenes	Mar 24, 2025	Cross-Modal RetrievalDisentanglement	—Unverified
Seeing Speech and Sound: Distinguishing and Locating Audio Sources in Visual Scenes	Jan 1, 2025	Cross-Modal RetrievalDisentanglement	—Unverified
Seeing the advantage: visually grounding word embeddings to better capture human semantic knowledge	Feb 21, 2022	Grounded language learningImage Retrieval	—Unverified
Seeing the Trees for the Forest: Rethinking Weakly-Supervised Medical Visual Grounding	May 21, 2025	Visual Grounding	—Unverified
Semantic Localization Guiding Segment Anything Model For Reference Remote Sensing Image Segmentation	Jun 12, 2025	Image SegmentationSegmentation	—Unverified

Show:10 25 50

← PrevPage 8 of 12Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified