Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 51–100 of 571 papers

Title	Date	Tasks	Status	Hype
F-LMM: Grounding Frozen Large Multimodal Models	Jun 9, 2024	General KnowledgeInstruction Following	CodeCode Available	2
Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language	Jun 9, 2024	Contrastive LearningCross-Modal Retrieval	CodeCode Available	2
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs	Apr 25, 2024	Visual GroundingVisual Question Answering	CodeCode Available	2
HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding	Apr 20, 2024	cross-modal alignmentVisual Grounding	CodeCode Available	2
VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis	Mar 29, 2024	HallucinationImage Captioning	CodeCode Available	2
MedPromptX: Grounded Multimodal Prompting for Chest X-ray Diagnosis	Mar 22, 2024	Medical DiagnosisMedical Visual Question Answering	CodeCode Available	2
The Revolution of Multimodal Large Language Models: A Survey	Feb 19, 2024	Image GenerationInstruction Following	CodeCode Available	2
ChatterBox: Multi-round Multimodal Referring and Grounding	Jan 24, 2024	Language ModelingLanguage Modelling	CodeCode Available	2
SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model	Jan 18, 2024	Instruction FollowingLanguage Modeling	CodeCode Available	2
Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression Segmentation	Jan 1, 2024	DescriptiveObject	CodeCode Available	2
One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompts	Dec 28, 2023	AllAnatomy	CodeCode Available	2
Aligning and Prompting Everything All at Once for Universal Visual Perception	Dec 4, 2023	AllObject	CodeCode Available	2
NExT-Chat: An LMM for Chat, Detection and Segmentation	Nov 8, 2023	Referring ExpressionReferring Expression Segmentation	CodeCode Available	2
From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models	Oct 13, 2023	HallucinationImage Captioning	CodeCode Available	2
LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent	Sep 21, 2023	3D visual groundingLanguage Modeling	CodeCode Available	2
3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment	Aug 8, 2023	3D Question Answering (3D-QA)Dense Captioning	CodeCode Available	2
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs	Jul 17, 2023	Instruction FollowingSentence	CodeCode Available	2
X^2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks	Nov 22, 2022	AllCross-Modal Retrieval	CodeCode Available	2
Referring Image Matting	Jun 10, 2022	Domain GeneralizationImage Matting	CodeCode Available	2
Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs	Jun 11, 2025	HallucinationObject Hallucination	CodeCode Available	1
GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents	May 21, 2025	Answer GenerationReinforcement Learning (RL)	CodeCode Available	1
Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving	May 13, 2025	3D visual groundingAutonomous Driving	CodeCode Available	1
STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection	Apr 3, 2025	Instruction FollowingLanguage Modeling	CodeCode Available	1
RefChartQA: Grounding Visual Answer on Chart Images through Instruction Tuning	Mar 29, 2025	Chart Question AnsweringChart Understanding	CodeCode Available	1
Visual Position Prompt for MLLM based Visual Grounding	Mar 19, 2025	PositionVisual Grounding	CodeCode Available	1
How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game	Mar 13, 2025	Multimodal ReasoningQuestion Answering	CodeCode Available	1
SwimVG: Step-wise Multimodal Fusion and Adaption for Visual Grounding	Feb 24, 2025	cross-modal alignmentVisual Grounding	CodeCode Available	1
Evolving Symbolic 3D Visual Grounder with Weakly Supervised Reflection	Feb 3, 2025	3D visual groundingVisual Grounding	CodeCode Available	1
NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning	Feb 1, 2025	Referring ExpressionVisual Grounding	CodeCode Available	1
PAINT: Paying Attention to INformed Tokens to Mitigate Hallucination in Large Vision-Language Model	Jan 21, 2025	HallucinationImage Captioning	CodeCode Available	1
When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis	Jan 17, 2025	Large Language ModelMultimodal Large Language Model	CodeCode Available	1
Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints	Jan 12, 2025	Image SegmentationReferring Expression	CodeCode Available	1
Open Eyes, Then Reason: Fine-grained Visual Mathematical Understanding in MLLMs	Jan 11, 2025	MathMathematical Problem-Solving	CodeCode Available	1
Solving Zero-Shot 3D Visual Grounding as Constraint Satisfaction Problems	Nov 21, 2024	3D visual groundingNegation	CodeCode Available	1
VividMed: Vision Language Model with Versatile Visual Grounding for Medicine	Oct 16, 2024	Language ModelingLanguage Modelling	CodeCode Available	1
Visual Grounding with Multi-modal Conditional Adaptation	Sep 8, 2024	object-detectionObject Detection	CodeCode Available	1
IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities	Aug 23, 2024	Language ModelingLanguage Modelling	CodeCode Available	1
Visual Grounding for Object-Level Generalization in Reinforcement Learning	Aug 4, 2024	Language ModellingObject	CodeCode Available	1
An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding	Aug 2, 2024	DecoderReasoning Segmentation	CodeCode Available	1
3D Vision and Language Pretraining with Large-Scale Synthetic Data	Jul 8, 2024	Dense CaptioningDiversity	CodeCode Available	1
Multi-branch Collaborative Learning Network for 3D Visual Grounding	Jul 7, 2024	3D visual groundingReferring Expression	CodeCode Available	1
CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation	Jul 1, 2024	Image-text RetrievalQuestion Answering	CodeCode Available	1
Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based Segmentation	Jun 11, 2024	Grounded Multimodal Named Entity Recognitionnamed-entity-recognition	CodeCode Available	1
Instruction-Guided Visual Masking	May 30, 2024	Instruction FollowingVisual Grounding	CodeCode Available	1
Talk2Radar: Bridging Natural Language with 4D mmWave Radar for 3D Referring Expression Comprehension	May 21, 2024	3D visual groundingReferring Expression	CodeCode Available	1
DARA: Domain- and Relation-aware Adapters Make Parameter-efficient Tuning for Visual Grounding	May 10, 2024	RelationSpatial Reasoning	CodeCode Available	1
Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling	Mar 21, 2024	Grounded language learningLanguage Acquisition	CodeCode Available	1
Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory	Mar 19, 2024	Adversarial TextDiversity	CodeCode Available	1
HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning	Mar 19, 2024	Reinforcement Learning (RL)Visual Grounding	CodeCode Available	1
MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding	Mar 5, 2024	3D visual groundingDecision Making	CodeCode Available	1

Show:10 25 50

← PrevPage 2 of 12Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified