Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 251–300 of 571 papers

Title	Date	Tasks	Status
Differentiable Disentanglement Filter: an Application Agnostic Core Concept Discovery Probe	Sep 4, 2019	DisentanglementVisual Grounding	—Unverified
Differentiable Parsing and Visual Grounding of Natural Language Instructions for Object Placement	Oct 1, 2022	Graph Neural NetworkObject	—Unverified
Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs	Jun 5, 2025	cross-modal alignmentDense Captioning	—Unverified
Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation	May 24, 2025	Mathematical ReasoningMultimodal Reasoning	—Unverified
Data-Efficient 3D Visual Grounding via Order-Aware Referring	Mar 25, 2024	3D visual groundingObject	—Unverified
DSM: Building A Diverse Semantic Map for 3D Visual Grounding	Apr 11, 2025	3D visual groundingScene Understanding	—Unverified
Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding	Jun 13, 2024	3D visual groundingAttribute	—Unverified
Dynamic Inference With Grounding Based Vision and Language Models	Jan 1, 2023	Language ModellingReferring Expression	—Unverified
Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding	Sep 28, 2022	DecoderVisual Grounding	—Unverified
EAGLE: Enhanced Visual Grounding Minimizes Hallucinations in Instructional Multimodal Models	Jan 6, 2025	HallucinationVisual Grounding	—Unverified
EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues	Dec 19, 2024	Change DetectionDisaster Response	—Unverified
EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments	Jun 9, 2025	BenchmarkingNavigate	—Unverified
Efficient Adaptation For Remote Sensing Visual Grounding	Mar 29, 2025	parameter-efficient fine-tuningVisual Grounding	—Unverified
Efficient Multi-Modal Embeddings from Structured Data	Oct 6, 2021	Semantic SimilaritySemantic Textual Similarity	—Unverified
Emergent Communication with World Models	Feb 22, 2020	Visual Grounding	—Unverified
ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities	Jul 1, 2024	3D visual groundingLanguage Modeling	—Unverified
Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions	Mar 5, 2025	Anomaly DetectionVisual Grounding	—Unverified
Expand BERT Representation with Visual Information via Grounded Language Learning with Multimodal Partial Alignment	Dec 4, 2023	Grounded language learningLanguage Modeling	—Unverified
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling	Dec 6, 2024	document understandingHallucination	—Unverified
Learning to Assemble Neural Module Tree Networks for Visual Grounding	Dec 8, 2018	Dependency ParsingNatural Language Visual Grounding	—Unverified
Explainable Video Entailment With Grounded Visual Evidence	Jan 1, 2021	Visual Grounding	—Unverified
FACET: Fairness in Computer Vision Evaluation Benchmark	Aug 31, 2023	Fairnessimage-classification	—Unverified
Fast visual grounding in interaction: bringing few-shot learning with neural networks to an interactive robot	Jun 1, 2020	Few-Shot LearningTransfer Learning	—Unverified
Few-Shot Visual Grounding for Natural Human-Robot Interaction	Mar 17, 2021	Visual Grounding	—Unverified
Finding "It": Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos	Jun 1, 2018	Multiple Instance LearningSentence	—Unverified
FindIt: Generalized Localization with Natural Language Queries	Mar 31, 2022	Natural Language QueriesObject	—Unverified
Fine-Grained Spatial and Verbal Losses for 3D Visual Grounding	Nov 5, 2024	3D visual groundingVisual Grounding	—Unverified
FLORA: Formal Language Model Enables Robust Training-free Zero-shot Object Referring Analysis	Jan 17, 2025	Bayesian InferenceLanguage Modeling	—Unverified
FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts	Jun 27, 2024	Decision MakingLogical Reasoning	—Unverified
Focusing On Targets For Improving Weakly Supervised Visual Grounding	Feb 22, 2023	Dependency ParsingObject	—Unverified
From Local Concepts to Universals: Evaluating the Multicultural Understanding of Vision-Language Models	Jun 28, 2024	DiversityRetrieval	—Unverified
From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes	Jun 5, 2025	3D visual groundingObject	—Unverified
G^3-LQ: Marrying Hyperbolic Alignment with Explicit Semantic-Geometric Modeling for 3D Visual Grounding	Jan 1, 2024	3D visual groundingVisual Grounding	—Unverified
GAFNet: A Global Fourier Self Attention Based Novel Network for multi-modal downstream tasks	Jan 1, 2023	Image GenerationImage-text Retrieval	—Unverified
GAGS: Granularity-Aware Feature Distillation for Language Gaussian Splatting	Dec 18, 2024	Scene UnderstandingSemantic Segmentation	—Unverified
GEMeX-ThinkVG: Towards Thinking with Visual Grounding in Medical VQA via Reinforcement Learning	Jun 22, 2025	Answer GenerationDecision Making	—Unverified
GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing	Jan 12, 2025	Image CaptioningLanguage Modeling	—Unverified
Giving Commands to a Self-driving Car: A Multimodal Reasoner for Visual Grounding	Mar 19, 2020	ObjectReferring Expression Comprehension	—Unverified
Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models	Oct 21, 2024	Instruction Followingobject-detection	—Unverified
GroundCap: A Visually Grounded Image Captioning Dataset	Feb 19, 2025	Image CaptioningObject Detection	—Unverified
GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding	Jun 26, 2025	3D visual groundingLarge Language Model	—Unverified
GRAPPA: Generalizing and Adapting Robot Policies via Online Agentic Guidance	Oct 9, 2024	Visual Grounding	—Unverified
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents	Jun 3, 2025	Visual Grounding	—Unverified
Guiding Visual Question Answering with Attention Priors	May 25, 2022	Question AnsweringVisual Grounding	—Unverified
HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation	Jun 26, 2025	counterfactualCounterfactual Reasoning	—Unverified
HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model	Jun 1, 2024	Action RecognitionActivity Recognition	—Unverified
HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task	Jun 4, 2024	Head Pose EstimationLanguage Modelling	—Unverified
Illustrative Language Understanding: Large-Scale Visual Grounding with Image Search	Jul 1, 2018	General ClassificationImage Retrieval	—Unverified
Image Difference Grounding with Natural Language	Apr 2, 2025	Visual Grounding	—Unverified
Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation	Jan 28, 2017	Response GenerationRetrieval	—Unverified

Show:10 25 50

← PrevPage 6 of 12Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified