Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 51–75 of 571 papers

Title	Date	Tasks	Status	Hype
Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language	Jun 9, 2024	Contrastive LearningCross-Modal Retrieval	CodeCode Available	2
F-LMM: Grounding Frozen Large Multimodal Models	Jun 9, 2024	General KnowledgeInstruction Following	CodeCode Available	2
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs	Apr 25, 2024	Visual GroundingVisual Question Answering	CodeCode Available	2
HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding	Apr 20, 2024	cross-modal alignmentVisual Grounding	CodeCode Available	2
VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis	Mar 29, 2024	HallucinationImage Captioning	CodeCode Available	2
MedPromptX: Grounded Multimodal Prompting for Chest X-ray Diagnosis	Mar 22, 2024	Medical DiagnosisMedical Visual Question Answering	CodeCode Available	2
The Revolution of Multimodal Large Language Models: A Survey	Feb 19, 2024	Image GenerationInstruction Following	CodeCode Available	2
ChatterBox: Multi-round Multimodal Referring and Grounding	Jan 24, 2024	Language ModelingLanguage Modelling	CodeCode Available	2
SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model	Jan 18, 2024	Instruction FollowingLanguage Modeling	CodeCode Available	2
Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression Segmentation	Jan 1, 2024	DescriptiveObject	CodeCode Available	2
One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompts	Dec 28, 2023	AllAnatomy	CodeCode Available	2
Aligning and Prompting Everything All at Once for Universal Visual Perception	Dec 4, 2023	AllObject	CodeCode Available	2
NExT-Chat: An LMM for Chat, Detection and Segmentation	Nov 8, 2023	Referring ExpressionReferring Expression Segmentation	CodeCode Available	2
From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models	Oct 13, 2023	HallucinationImage Captioning	CodeCode Available	2
LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent	Sep 21, 2023	3D visual groundingLanguage Modeling	CodeCode Available	2
3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment	Aug 8, 2023	3D Question Answering (3D-QA)Dense Captioning	CodeCode Available	2
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs	Jul 17, 2023	Instruction FollowingSentence	CodeCode Available	2
X^2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks	Nov 22, 2022	AllCross-Modal Retrieval	CodeCode Available	2
Referring Image Matting	Jun 10, 2022	Domain GeneralizationImage Matting	CodeCode Available	2
Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs	Jun 11, 2025	HallucinationObject Hallucination	CodeCode Available	1
GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents	May 21, 2025	Answer GenerationReinforcement Learning (RL)	CodeCode Available	1
Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving	May 13, 2025	3D visual groundingAutonomous Driving	CodeCode Available	1
STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection	Apr 3, 2025	Instruction FollowingLanguage Modeling	CodeCode Available	1
RefChartQA: Grounding Visual Answer on Chart Images through Instruction Tuning	Mar 29, 2025	Chart Question AnsweringChart Understanding	CodeCode Available	1
Visual Position Prompt for MLLM based Visual Grounding	Mar 19, 2025	PositionVisual Grounding	CodeCode Available	1

Show:10 25 50

← PrevPage 3 of 23Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified