Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–50 of 571 papers

Title	Date	Tasks	Status	Hype	Score
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model	Apr 10, 2025	Language ModelingLanguage Modelling	CodeCode Available	9	5
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding	Dec 13, 2024	Chart UnderstandingMixture-of-Experts	CodeCode Available	9	5
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning	Oct 14, 2023	Image ClassificationImage Description	CodeCode Available	7	5
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives	Jan 7, 2025	Autonomous DrivingGeneral Knowledge	CodeCode Available	5	5
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs	Jun 24, 2024	Representation LearningVisual Grounding	CodeCode Available	5	5
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond	Aug 24, 2023	Chart Question AnsweringFS-MEVQA	CodeCode Available	5	5
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models	Apr 19, 2024	Language ModelingLanguage Modelling	CodeCode Available	4	5
V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs	Jan 1, 2024	Visual GroundingWorld Knowledge	CodeCode Available	4	5
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V	Oct 17, 2023	Interactive SegmentationReferring Expression	CodeCode Available	4	5
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video	Feb 1, 2023	Action ClassificationImage Classification	CodeCode Available	4	5
MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations	Jun 13, 2024	3D visual groundingAttribute	CodeCode Available	4	5
OrionBench: A Benchmark for Chart and Human-Recognizable Object Detection in Infographics	May 23, 2025	Chart Understandingobject-detection	CodeCode Available	3	5
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents	Oct 7, 2024	Natural Language Visual GroundingNavigate	CodeCode Available	3	5
AgentStudio: A Toolkit for Building General Virtual Agents	Mar 26, 2024	Visual Grounding	CodeCode Available	3	5
DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models	Jun 17, 2024	Document ClassificationVisual Grounding	CodeCode Available	3	5
Champion Solution for the WSDM2023 Toloka VQA Challenge	Jan 22, 2023	Question AnsweringVisual Grounding	CodeCode Available	3	5
A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances, and Future Directions	Jun 9, 2024	3D visual groundingSurvey	CodeCode Available	3	5
Vision-Language Pre-training: Basics, Recent Advances, and Future Trends	Oct 17, 2022	Few-Shot LearningImage Captioning	CodeCode Available	3	5
Aria-UI: Visual Grounding for GUI Instructions	Dec 20, 2024	Natural Language Visual GroundingVisual Grounding	CodeCode Available	3	5
Towards Visual Grounding: A Survey	Dec 28, 2024	Phrase GroundingReferring Expression	CodeCode Available	3	5
ShapeLLM: Universal 3D Object Understanding for Embodied Interaction	Feb 27, 2024	3D geometry3D Object Captioning	CodeCode Available	3	5
BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence	Nov 22, 2024	3D visual groundingVisual Grounding	CodeCode Available	3	5
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs	Jan 11, 2024	Representation LearningSelf-Supervised Learning	CodeCode Available	3	5
Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding	Feb 14, 2025	3D Object Detection3D visual grounding	CodeCode Available	3	5
Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning	May 18, 2025	Reinforcement Learning (RL)Visual Grounding	CodeCode Available	3	5
RefMask3D: Language-Guided Transformer for 3D Referring Segmentation	Jul 25, 2024	3D visual groundingImage Segmentation	CodeCode Available	2	5
Aligning and Prompting Everything All at Once for Universal Visual Perception	Dec 4, 2023	AllObject	CodeCode Available	2	5
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories	Mar 11, 2025	Decision MakingInteractive Segmentation	CodeCode Available	2	5
Reasoning to Attend: Try to Understand How <SEG> Token Works	Dec 23, 2024	Semantic SimilaritySemantic Textual Similarity	CodeCode Available	2	5
Referring Image Matting	Jun 10, 2022	Domain GeneralizationImage Matting	CodeCode Available	2	5
NExT-Chat: An LMM for Chat, Detection and Segmentation	Nov 8, 2023	Referring ExpressionReferring Expression Segmentation	CodeCode Available	2	5
AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention	Jun 18, 2024	ObjectResponse Generation	CodeCode Available	2	5
DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World	Jun 30, 2025	Caption GenerationObject	CodeCode Available	2	5
One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompts	Dec 28, 2023	AllAnatomy	CodeCode Available	2	5
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories	Mar 11, 2025	Decision MakingInteractive Segmentation	CodeCode Available	2	5
LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent	Sep 21, 2023	3D visual groundingLanguage Modeling	CodeCode Available	2	5
ChatterBox: Multi-round Multimodal Referring and Grounding	Jan 24, 2024	Language ModelingLanguage Modelling	CodeCode Available	2	5
MedPromptX: Grounded Multimodal Prompting for Chest X-ray Diagnosis	Mar 22, 2024	Medical DiagnosisMedical Visual Question Answering	CodeCode Available	2	5
InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition	May 21, 2025	Earth ObservationObject	CodeCode Available	2	5
In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation	Aug 9, 2024	Image to textObject	CodeCode Available	2	5
Interpreting Object-level Foundation Models via Visual Precision Search	Nov 25, 2024	Explainable Artificial Intelligence (XAI)Object	CodeCode Available	2	5
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs	Jul 17, 2023	Instruction FollowingSentence	CodeCode Available	2	5
3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment	Aug 8, 2023	3D Question Answering (3D-QA)Dense Captioning	CodeCode Available	2	5
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding	Mar 17, 2025	Domain GeneralizationMultimodal Reasoning	CodeCode Available	2	5
Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding	Sep 5, 2024	Question AnsweringScene Understanding	CodeCode Available	2	5
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs	Apr 25, 2024	Visual GroundingVisual Question Answering	CodeCode Available	2	5
VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis	Mar 29, 2024	HallucinationImage Captioning	CodeCode Available	2	5
A Simple Aerial Detection Baseline of Multimodal Language Models	Jan 16, 2025	object-detectionObject Detection	CodeCode Available	2	5
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning	Jul 8, 2025	MMEReinforcement Learning (RL)	CodeCode Available	2	5
GTA1: GUI Test-time Scaling Agent	Jul 8, 2025	Reinforcement Learning (RL)Task Planning	CodeCode Available	2	5

Show:10 25 50

← PrevPage 1 of 12Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified