Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 251–300 of 571 papers

Title	Date	Tasks	Status
Towards Visual Text Grounding of Multimodal Large Language Model	Apr 7, 2025	BenchmarkingLanguage Modeling	—Unverified
Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities	Apr 2, 2025	DescriptiveLarge Language Model	CodeCode Available
Image Difference Grounding with Natural Language	Apr 2, 2025	Visual Grounding	—Unverified
Multimodal Reference Visual Grounding	Apr 2, 2025	Few-Shot Object DetectionVisual Grounding	—Unverified
MB-ORES: A Multi-Branch Object Reasoner for Visual Grounding in Remote Sensing	Mar 31, 2025	Objectobject-detection	CodeCode Available
ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning	Mar 30, 2025	3D visual groundingFeature Splatting	—Unverified
Efficient Adaptation For Remote Sensing Visual Grounding	Mar 29, 2025	parameter-efficient fine-tuningVisual Grounding	—Unverified
NuGrounding: A Multi-View 3D Visual Grounding Framework in Autonomous Driving	Mar 28, 2025	3D visual groundingAutonomous Driving	—Unverified
Beyond Object Categories: Multi-Attribute Reference Understanding for Visual Grounding	Mar 25, 2025	AttributeObject	—Unverified
Seeing Speech and Sound: Distinguishing and Locating Audios in Visual Scenes	Mar 24, 2025	Cross-Modal RetrievalDisentanglement	—Unverified
A Vision Centric Remote Sensing Benchmark	Mar 20, 2025	Question AnsweringRepresentation Learning	—Unverified
LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation	Mar 18, 2025	DecoderObject	CodeCode Available
Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding	Mar 8, 2025	Language ModelingLanguage Modelling	—Unverified
Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions	Mar 5, 2025	Anomaly DetectionVisual Grounding	—Unverified
Teaching Metric Distance to Autoregressive Multimodal Foundational Models	Mar 4, 2025	Image GenerationVisual Grounding	—Unverified
Structured Preference Optimization for Vision-Language Long-Horizon Task Planning	Feb 28, 2025	Task PlanningVisual Grounding	—Unverified
ProxyTransformation: Preshaping Point Cloud Manifold With Proxy Attention For 3D Visual Grounding	Feb 26, 2025	3D visual groundingVisual Grounding	—Unverified
Programming with Pixels: Computer-Use Meets Software Engineering	Feb 24, 2025	Visual Grounding	—Unverified
GroundCap: A Visually Grounded Image Captioning Dataset	Feb 19, 2025	Image CaptioningObject Detection	—Unverified
Leveraging Multimodal-LLMs Assisted by Instance Segmentation for Intelligent Traffic Monitoring	Feb 16, 2025	Instance SegmentationLanguage Modeling	—Unverified
TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation	Feb 11, 2025	RetrievalVision and Language Navigation	—Unverified
RLS3: RL-Based Synthetic Sample Selection to Enhance Spatial Reasoning in Vision-Language Models for Indoor Autonomous Perception	Jan 31, 2025	Reinforcement Learning (RL)Spatial Reasoning	—Unverified
ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations	Jan 24, 2025	DecoderObject	—Unverified
FLORA: Formal Language Model Enables Robust Training-free Zero-shot Object Referring Analysis	Jan 17, 2025	Bayesian InferenceLanguage Modeling	—Unverified
AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring	Jan 16, 2025	3D visual groundingDecoder	—Unverified
GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing	Jan 12, 2025	Image CaptioningLanguage Modeling	—Unverified
EAGLE: Enhanced Visual Grounding Minimizes Hallucinations in Instructional Multimodal Models	Jan 6, 2025	HallucinationVisual Grounding	—Unverified
ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding	Jan 2, 2025	3D visual groundingDiagnostic	—Unverified
Seeing Speech and Sound: Distinguishing and Locating Audio Sources in Visual Scenes	Jan 1, 2025	Cross-Modal RetrievalDisentanglement	—Unverified
Ges3ViG : Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference Understanding	Jan 1, 2025	3D visual groundingData Augmentation	CodeCode Available
VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos	Jan 1, 2025	Large Language ModelVideo Segmentation	—Unverified
Beyond Human Perception: Understanding Multi-Object World from Monocular View	Jan 1, 2025	3D visual groundingDenoising	CodeCode Available
Task-aware Cross-modal Feature Refinement Transformer with Large Language Models for Visual Grounding	Jan 1, 2025	Referring ExpressionReferring Expression Comprehension	—Unverified
Referencing Where to Focus: Improving VisualGrounding with Referential Query	Dec 26, 2024	DecoderVisual Grounding	—Unverified
CoF: Coarse to Fine-Grained Image Understanding for Multi-modal Large Language Models	Dec 22, 2024	Language ModelingLanguage Modelling	CodeCode Available
EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues	Dec 19, 2024	Change DetectionDisaster Response	—Unverified
FiVL: A Framework for Improved Vision-Language Alignment	Dec 19, 2024	Answer GenerationMultimodal Reasoning	CodeCode Available
GAGS: Granularity-Aware Feature Distillation for Language Gaussian Splatting	Dec 18, 2024	Scene UnderstandingSemantic Segmentation	—Unverified
Barking Up The Syntactic Tree: Enhancing VLM Training with Syntactic Losses	Dec 11, 2024	Image-text RetrievalQuestion Answering	—Unverified
Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models	Dec 11, 2024	Question AnsweringVisual Grounding	CodeCode Available
3D Spatial Understanding in MLLMs: Disambiguation and Evaluation	Dec 9, 2024	3D dense captioning3D visual grounding	—Unverified
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling	Dec 6, 2024	document understandingHallucination	—Unverified
M^3D: A Multimodal, Multilingual and Multitask Dataset for Grounded Document-level Information Extraction	Dec 5, 2024	Relation ExtractionVisual Grounding	CodeCode Available
SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding	Dec 5, 2024	3D visual groundingObject Localization	—Unverified
Paint Outside the Box: Synthesizing and Selecting Training Data for Visual Grounding	Dec 1, 2024	Visual Grounding	—Unverified
3D Scene Graph Guided Vision-Language Pre-training	Nov 27, 2024	3D dense captioning3D visual grounding	—Unverified
Visual Contexts Clarify Ambiguous Expressions: A Benchmark Dataset	Nov 21, 2024	Question AnsweringVisual Grounding	CodeCode Available
Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level	Nov 15, 2024	Benchmarkingcounterfactual	—Unverified
LidaRefer: Outdoor 3D Visual Grounding for Autonomous Driving with Transformers	Nov 7, 2024	3D visual groundingAutonomous Driving	—Unverified
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos	Nov 7, 2024	DecoderLanguage Modeling	—Unverified

Show:10 25 50

← PrevPage 6 of 12Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified