Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 51–100 of 571 papers

Title	Date	Tasks	Status	Hype	Score
URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics	Jan 8, 2025	MathMathematical Reasoning	CodeCode Available	2	5
RefMask3D: Language-Guided Transformer for 3D Referring Segmentation	Jul 25, 2024	3D visual groundingImage Segmentation	CodeCode Available	2	5
DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World	Jun 30, 2025	Caption GenerationObject	CodeCode Available	2	5
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories	Mar 11, 2025	Decision MakingInteractive Segmentation	CodeCode Available	2	5
Reasoning to Attend: Try to Understand How <SEG> Token Works	Dec 23, 2024	Semantic SimilaritySemantic Textual Similarity	CodeCode Available	2	5
LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent	Sep 21, 2023	3D visual groundingLanguage Modeling	CodeCode Available	2	5
Aligning and Prompting Everything All at Once for Universal Visual Perception	Dec 4, 2023	AllObject	CodeCode Available	2	5
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning	Jul 8, 2025	MMEReinforcement Learning (RL)	CodeCode Available	2	5
Referring Image Matting	Jun 10, 2022	Domain GeneralizationImage Matting	CodeCode Available	2	5
A Simple Aerial Detection Baseline of Multimodal Language Models	Jan 16, 2025	object-detectionObject Detection	CodeCode Available	2	5
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories	Mar 11, 2025	Decision MakingInteractive Segmentation	CodeCode Available	2	5
NExT-Chat: An LMM for Chat, Detection and Segmentation	Nov 8, 2023	Referring ExpressionReferring Expression Segmentation	CodeCode Available	2	5
GTA1: GUI Test-time Scaling Agent	Jul 8, 2025	Reinforcement Learning (RL)Task Planning	CodeCode Available	2	5
One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompts	Dec 28, 2023	AllAnatomy	CodeCode Available	2	5
ChatterBox: Multi-round Multimodal Referring and Grounding	Jan 24, 2024	Language ModelingLanguage Modelling	CodeCode Available	2	5
InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition	May 21, 2025	Earth ObservationObject	CodeCode Available	2	5
MedPromptX: Grounded Multimodal Prompting for Chest X-ray Diagnosis	Mar 22, 2024	Medical DiagnosisMedical Visual Question Answering	CodeCode Available	2	5
Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention	Jan 1, 2025	HallucinationResponse Generation	CodeCode Available	2	5
SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding	Jul 3, 2024	object-detectionObject Detection	CodeCode Available	2	5
GRAVL-BERT: Graphical Visual-Linguistic Representations for Multimodal Coreference Resolution	Oct 1, 2022	coreference-resolutionCoreference Resolution	CodeCode Available	1	5
Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory	Mar 19, 2024	Adversarial TextDiversity	CodeCode Available	1	5
Visual Grounding Methods for VQA are Working for the Wrong Reasons!	Apr 12, 2020	Question AnsweringVisual Grounding	CodeCode Available	1	5
GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging Cross-Modal Attention with Large Language Models	Dec 6, 2023	Autonomous DrivingAutonomous Vehicles	CodeCode Available	1	5
LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition	Feb 15, 2024	Grounded Multimodal Named Entity RecognitionMulti-modal Named Entity Recognition	CodeCode Available	1	5
Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with Human Intentions	Feb 17, 2024	Visual Grounding	CodeCode Available	1	5
Grounded Situation Recognition with Transformers	Nov 19, 2021	DecoderGrounded Situation Recognition	CodeCode Available	1	5
Local-Global Context Aware Transformer for Language-Guided Video Segmentation	Mar 18, 2022	Referring Expression SegmentationReferring Video Object Segmentation	CodeCode Available	1	5
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks	Nov 10, 2023	DiversityMulti-Task Learning	CodeCode Available	1	5
Learning Cross-modal Context Graph for Visual Grounding	Nov 20, 2019	Graph MatchingGraph Neural Network	CodeCode Available	1	5
Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in Clutter	Nov 9, 2023	ObjectVisual Grounding	CodeCode Available	1	5
Learning Cross-modal Context Graph for Visual Grounding	Feb 13, 2020	Graph MatchingGraph Neural Network	CodeCode Available	1	5
Fine-Grained Semantically Aligned Vision-Language Pre-Training	Aug 4, 2022	cross-modal alignmentobject-detection	CodeCode Available	1	5
Iterative Robust Visual Grounding with Masked Reference based Centerpoint Supervision	Jul 23, 2023	DecoderVisual Grounding	CodeCode Available	1	5
Joint Visual Grounding and Tracking with Natural Language Specification	Mar 21, 2023	Visual GroundingVisual Tracking	CodeCode Available	1	5
Cyclic Co-Learning of Sounding Object Visual Grounding and Sound Separation	Apr 5, 2021	ObjectVisual Grounding	CodeCode Available	1	5
Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving	May 13, 2025	3D visual groundingAutonomous Driving	CodeCode Available	1	5
PAINT: Paying Attention to INformed Tokens to Mitigate Hallucination in Large Vision-Language Model	Jan 21, 2025	HallucinationImage Captioning	CodeCode Available	1	5
Kosmos-2: Grounding Multimodal Large Language Models to the World	Jun 26, 2023	Image CaptioningIn-Context Learning	CodeCode Available	1	5
Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling	Mar 21, 2024	Grounded language learningLanguage Acquisition	CodeCode Available	1	5
Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding	Nov 25, 2022	3D visual groundingKnowledge Distillation	CodeCode Available	1	5
CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation	Jul 1, 2024	Image-text RetrievalQuestion Answering	CodeCode Available	1	5
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling	Nov 23, 2021	Image CaptioningImage Description	CodeCode Available	1	5
A Unified Framework for 3D Point Cloud Visual Grounding	Aug 23, 2023	CPUGPU	CodeCode Available	1	5
Cross3DVG: Cross-Dataset 3D Visual Grounding on Different RGB-D Scans	May 23, 2023	3D Reconstruction3D visual grounding	CodeCode Available	1	5
CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models	Sep 24, 2021	Visual Grounding	CodeCode Available	1	5
A Fast and Accurate One-Stage Approach to Visual Grounding	Aug 18, 2019	Referring ExpressionReferring Expression Comprehension	CodeCode Available	1	5
Evolving Symbolic 3D Visual Grounder with Weakly Supervised Reflection	Feb 3, 2025	3D visual groundingVisual Grounding	CodeCode Available	1	5
An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding	Aug 2, 2024	DecoderReasoning Segmentation	CodeCode Available	1	5
EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding	Sep 29, 2022	3D visual groundingObject	CodeCode Available	1	5
InfMLLM: A Unified Framework for Visual-Language Tasks	Nov 12, 2023	GPUImage Captioning	CodeCode Available	1	5

Show:10 25 50

← PrevPage 2 of 12Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified