Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–50 of 571 papers

Title	Date	Tasks	Status	Hype
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model	Apr 10, 2025	Language ModelingLanguage Modelling	CodeCode Available	9
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding	Dec 13, 2024	Chart UnderstandingMixture-of-Experts	CodeCode Available	9
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning	Oct 14, 2023	Image ClassificationImage Description	CodeCode Available	7
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives	Jan 7, 2025	Autonomous DrivingGeneral Knowledge	CodeCode Available	5
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs	Jun 24, 2024	Representation LearningVisual Grounding	CodeCode Available	5
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond	Aug 24, 2023	Chart Question AnsweringFS-MEVQA	CodeCode Available	5
MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations	Jun 13, 2024	3D visual groundingAttribute	CodeCode Available	4
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models	Apr 19, 2024	Language ModelingLanguage Modelling	CodeCode Available	4
V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs	Jan 1, 2024	Visual GroundingWorld Knowledge	CodeCode Available	4
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V	Oct 17, 2023	Interactive SegmentationReferring Expression	CodeCode Available	4
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video	Feb 1, 2023	Action ClassificationImage Classification	CodeCode Available	4
OrionBench: A Benchmark for Chart and Human-Recognizable Object Detection in Infographics	May 23, 2025	Chart Understandingobject-detection	CodeCode Available	3
Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning	May 18, 2025	Reinforcement Learning (RL)Visual Grounding	CodeCode Available	3
Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding	Feb 14, 2025	3D Object Detection3D visual grounding	CodeCode Available	3
Towards Visual Grounding: A Survey	Dec 28, 2024	Phrase GroundingReferring Expression	CodeCode Available	3
Aria-UI: Visual Grounding for GUI Instructions	Dec 20, 2024	Natural Language Visual GroundingVisual Grounding	CodeCode Available	3
BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence	Nov 22, 2024	3D visual groundingVisual Grounding	CodeCode Available	3
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents	Oct 7, 2024	Natural Language Visual GroundingNavigate	CodeCode Available	3
DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models	Jun 17, 2024	Document ClassificationVisual Grounding	CodeCode Available	3
A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances, and Future Directions	Jun 9, 2024	3D visual groundingSurvey	CodeCode Available	3
AgentStudio: A Toolkit for Building General Virtual Agents	Mar 26, 2024	Visual Grounding	CodeCode Available	3
ShapeLLM: Universal 3D Object Understanding for Embodied Interaction	Feb 27, 2024	3D geometry3D Object Captioning	CodeCode Available	3
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs	Jan 11, 2024	Representation LearningSelf-Supervised Learning	CodeCode Available	3
Champion Solution for the WSDM2023 Toloka VQA Challenge	Jan 22, 2023	Question AnsweringVisual Grounding	CodeCode Available	3
Vision-Language Pre-training: Basics, Recent Advances, and Future Trends	Oct 17, 2022	Few-Shot LearningImage Captioning	CodeCode Available	3
GTA1: GUI Test-time Scaling Agent	Jul 8, 2025	Reinforcement Learning (RL)Task Planning	CodeCode Available	2
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning	Jul 8, 2025	MMEReinforcement Learning (RL)	CodeCode Available	2
DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World	Jun 30, 2025	Caption GenerationObject	CodeCode Available	2
InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition	May 21, 2025	Earth ObservationObject	CodeCode Available	2
HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model	Mar 17, 2025	Image SegmentationSegmentation	CodeCode Available	2
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding	Mar 17, 2025	Domain GeneralizationMultimodal Reasoning	CodeCode Available	2
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories	Mar 11, 2025	Decision MakingInteractive Segmentation	CodeCode Available	2
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories	Mar 11, 2025	Decision MakingInteractive Segmentation	CodeCode Available	2
A Simple Aerial Detection Baseline of Multimodal Language Models	Jan 16, 2025	object-detectionObject Detection	CodeCode Available	2
URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics	Jan 8, 2025	MathMathematical Reasoning	CodeCode Available	2
Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention	Jan 1, 2025	HallucinationResponse Generation	CodeCode Available	2
Reasoning to Attend: Try to Understand How <SEG> Token Works	Dec 23, 2024	Semantic SimilaritySemantic Textual Similarity	CodeCode Available	2
TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action	Dec 7, 2024	Depth EstimationMathematical Reasoning	CodeCode Available	2
Interpreting Object-level Foundation Models via Visual Precision Search	Nov 25, 2024	Explainable Artificial Intelligence (XAI)Object	CodeCode Available	2
GeoGround: A Unified Large Vision-Language Model for Remote Sensing Visual Grounding	Nov 16, 2024	Instruction FollowingLanguage Modeling	CodeCode Available	2
VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding	Oct 17, 2024	3D geometry3D visual grounding	CodeCode Available	2
VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI	Oct 15, 2024	Question AnsweringVideo Question Answering	CodeCode Available	2
SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion	Sep 26, 2024	DescriptiveGeneralized Referring Expression Comprehension	CodeCode Available	2
Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding	Sep 5, 2024	Question AnsweringScene Understanding	CodeCode Available	2
In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation	Aug 9, 2024	Image to textObject	CodeCode Available	2
RefMask3D: Language-Guided Transformer for 3D Referring Segmentation	Jul 25, 2024	3D visual groundingImage Segmentation	CodeCode Available	2
SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding	Jul 3, 2024	object-detectionObject Detection	CodeCode Available	2
AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention	Jun 18, 2024	ObjectResponse Generation	CodeCode Available	2
VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding	Jun 18, 2024	Image CaptioningQuestion Answering	CodeCode Available	2
Towards Vision-Language Geo-Foundation Model: A Survey	Jun 13, 2024	Earth ObservationImage Captioning	CodeCode Available	2

Show:10 25 50

← PrevPage 1 of 12Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified