Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 101–150 of 571 papers

Title	Date	Tasks	Status	Hype
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives	Jan 7, 2025	Autonomous DrivingGeneral Knowledge	CodeCode Available	5
EAGLE: Enhanced Visual Grounding Minimizes Hallucinations in Instructional Multimodal Models	Jan 6, 2025	HallucinationVisual Grounding	—Unverified	0
ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding	Jan 2, 2025	3D visual groundingDiagnostic	—Unverified	0
Seeing Speech and Sound: Distinguishing and Locating Audio Sources in Visual Scenes	Jan 1, 2025	Cross-Modal RetrievalDisentanglement	—Unverified	0
Beyond Human Perception: Understanding Multi-Object World from Monocular View	Jan 1, 2025	3D visual groundingDenoising	CodeCode Available	0
VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos	Jan 1, 2025	Large Language ModelVideo Segmentation	—Unverified	0
Ges3ViG : Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference Understanding	Jan 1, 2025	3D visual groundingData Augmentation	CodeCode Available	0
Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention	Jan 1, 2025	HallucinationResponse Generation	CodeCode Available	2
Task-aware Cross-modal Feature Refinement Transformer with Large Language Models for Visual Grounding	Jan 1, 2025	Referring ExpressionReferring Expression Comprehension	—Unverified	0
Towards Visual Grounding: A Survey	Dec 28, 2024	Phrase GroundingReferring Expression	CodeCode Available	3
Referencing Where to Focus: Improving VisualGrounding with Referential Query	Dec 26, 2024	DecoderVisual Grounding	—Unverified	0
Reasoning to Attend: Try to Understand How <SEG> Token Works	Dec 23, 2024	Semantic SimilaritySemantic Textual Similarity	CodeCode Available	2
CoF: Coarse to Fine-Grained Image Understanding for Multi-modal Large Language Models	Dec 22, 2024	Language ModelingLanguage Modelling	CodeCode Available	0
Aria-UI: Visual Grounding for GUI Instructions	Dec 20, 2024	Natural Language Visual GroundingVisual Grounding	CodeCode Available	3
FiVL: A Framework for Improved Vision-Language Alignment	Dec 19, 2024	Answer GenerationMultimodal Reasoning	CodeCode Available	0
EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues	Dec 19, 2024	Change DetectionDisaster Response	—Unverified	0
GAGS: Granularity-Aware Feature Distillation for Language Gaussian Splatting	Dec 18, 2024	Scene UnderstandingSemantic Segmentation	—Unverified	0
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding	Dec 13, 2024	Chart UnderstandingMixture-of-Experts	CodeCode Available	9
Barking Up The Syntactic Tree: Enhancing VLM Training with Syntactic Losses	Dec 11, 2024	Image-text RetrievalQuestion Answering	—Unverified	0
Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models	Dec 11, 2024	Question AnsweringVisual Grounding	CodeCode Available	0
3D Spatial Understanding in MLLMs: Disambiguation and Evaluation	Dec 9, 2024	3D dense captioning3D visual grounding	—Unverified	0
TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action	Dec 7, 2024	Depth EstimationMathematical Reasoning	CodeCode Available	2
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling	Dec 6, 2024	document understandingHallucination	CodeCode Available	0
M^3D: A Multimodal, Multilingual and Multitask Dataset for Grounded Document-level Information Extraction	Dec 5, 2024	Relation ExtractionVisual Grounding	CodeCode Available	0
SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding	Dec 5, 2024	3D visual groundingObject Localization	—Unverified	0
Paint Outside the Box: Synthesizing and Selecting Training Data for Visual Grounding	Dec 1, 2024	Visual Grounding	—Unverified	0
3D Scene Graph Guided Vision-Language Pre-training	Nov 27, 2024	3D dense captioning3D visual grounding	—Unverified	0
Interpreting Object-level Foundation Models via Visual Precision Search	Nov 25, 2024	Explainable Artificial Intelligence (XAI)Object	CodeCode Available	2
BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence	Nov 22, 2024	3D visual groundingVisual Grounding	CodeCode Available	3
Solving Zero-Shot 3D Visual Grounding as Constraint Satisfaction Problems	Nov 21, 2024	3D visual groundingNegation	CodeCode Available	1
Visual Contexts Clarify Ambiguous Expressions: A Benchmark Dataset	Nov 21, 2024	Question AnsweringVisual Grounding	CodeCode Available	0
GeoGround: A Unified Large Vision-Language Model for Remote Sensing Visual Grounding	Nov 16, 2024	Instruction FollowingLanguage Modeling	CodeCode Available	2
Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level	Nov 15, 2024	Benchmarkingcounterfactual	—Unverified	0
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos	Nov 7, 2024	DecoderLanguage Modeling	—Unverified	0
LidaRefer: Outdoor 3D Visual Grounding for Autonomous Driving with Transformers	Nov 7, 2024	3D visual groundingAutonomous Driving	—Unverified	0
Fine-Grained Spatial and Verbal Losses for 3D Visual Grounding	Nov 5, 2024	3D visual groundingVisual Grounding	—Unverified	0
Phrase Decoupling Cross-Modal Hierarchical Matching and Progressive Position Correction for Visual Grounding	Oct 31, 2024	ObjectPosition	CodeCode Available	0
Parameter-Efficient Fine-Tuning Medical Multimodal Large Language Models for Medical Visual Grounding	Oct 31, 2024	parameter-efficient fine-tuningVisual Grounding	—Unverified	0
Few-Shot Multimodal Explanation for Visual Question Answering	Oct 28, 2024	Explainable artificial intelligenceExplainable Artificial Intelligence (XAI)	CodeCode Available	0
Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models	Oct 21, 2024	Instruction Followingobject-detection	CodeCode Available	0
Joint Top-Down and Bottom-Up Frameworks for 3D Visual Grounding	Oct 21, 2024	3D visual groundingObject	—Unverified	0
VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding	Oct 17, 2024	3D geometry3D visual grounding	CodeCode Available	2
VividMed: Vision Language Model with Versatile Visual Grounding for Medicine	Oct 16, 2024	Language ModelingLanguage Modelling	CodeCode Available	1
MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs	Oct 16, 2024	Visual Grounding	CodeCode Available	0
Context-Infused Visual Grounding for Art	Oct 16, 2024	object-detectionObject Detection	CodeCode Available	0
VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI	Oct 15, 2024	Question AnsweringVideo Question Answering	CodeCode Available	2
Learning to Ground VLMs without Forgetting	Oct 14, 2024	DecoderLanguage Modelling	—Unverified	0
Neural Material Adaptor for Visual Grounding of Intrinsic Dynamics	Oct 10, 2024	Visual Grounding	—Unverified	0
GRAPPA: Generalizing and Adapting Robot Policies via Online Agentic Guidance	Oct 9, 2024	Visual Grounding	—Unverified	0
Context-Aware Command Understanding for Tabletop Scenarios	Oct 8, 2024	Decision MakingVisual Grounding	—Unverified	0

Show:10 25 50

← PrevPage 3 of 12Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified