Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 326–350 of 571 papers

Title	Date	Tasks	Status
Leveraging Multimodal-LLMs Assisted by Instance Segmentation for Intelligent Traffic Monitoring	Feb 16, 2025	Instance SegmentationLanguage Modeling	—Unverified
Leveraging Past References for Robust Language Grounding	Nov 1, 2019	ObjectReferring Expression	—Unverified
LidaRefer: Outdoor 3D Visual Grounding for Autonomous Driving with Transformers	Nov 7, 2024	3D visual groundingAutonomous Driving	—Unverified
Lightweight In-Context Tuning for Multimodal Unified Models	Oct 8, 2023	Image CaptioningIn-Context Learning	—Unverified
Like a bilingual baby: The advantage of visually grounding a bilingual language model	Oct 11, 2022	Language ModelingLanguage Modelling	—Unverified
LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding	May 27, 2024	Visual Grounding	—Unverified
LQMFormer: Language-aware Query Mask Transformer for Referring Image Segmentation	Jan 1, 2024	Image SegmentationSemantic Segmentation	—Unverified
M4CXR: Exploring Multi-task Potentials of Multi-modal Large Language Models for Chest X-ray Interpretation	Aug 29, 2024	Instruction FollowingMedical Report Generation	—Unverified
MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning	Oct 9, 2022	Image-text Retrievalmultimodal interaction	—Unverified
Medical Phrase Grounding with Region-Phrase Context Contrastive Alignment	Mar 14, 2023	Medical Image AnalysisPhrase Grounding	—Unverified
MedRG: Medical Report Grounding with Multi-modal Large Language Model	Apr 10, 2024	DecoderLanguage Modeling	—Unverified
MedSG-Bench: A Benchmark for Medical Image Sequences Grounding	May 17, 2025	Visual GroundingVisual Question Answering (VQA)	—Unverified
Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration	May 27, 2025	HallucinationVisual Grounding	—Unverified
MMR: Evaluating Reading Ability of Large Multimodal Models	Aug 26, 2024	Font RecognitionMMR total	—Unverified
MNER-QG: An End-to-End MRC framework for Multimodal Named Entity Recognition with Query Grounding	Nov 27, 2022	named-entity-recognitionNamed Entity Recognition	—Unverified
MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs	Jun 2, 2025	Instruction FollowingText Generation	—Unverified
More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models	May 23, 2025	DiagnosticHallucination	—Unverified
Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level	Nov 15, 2024	Benchmarkingcounterfactual	—Unverified
Movie Box Office Prediction With Self-Supervised and Visually Grounded Pretraining	Apr 20, 2023	Visual Grounding	—Unverified
mRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation	May 29, 2025	Question AnsweringRAG	—Unverified
Multi-Granularity Modularized Network for Abstract Visual Reasoning	Jul 9, 2020	Visual GroundingVisual Reasoning	—Unverified
Multimodal Reference Visual Grounding	Apr 2, 2025	Few-Shot Object DetectionVisual Grounding	—Unverified
Multimodal Unified Attention Networks for Vision-and-Language Interactions	Aug 12, 2019	Question AnsweringVisual Grounding	—Unverified
Multi-task Learning of Hierarchical Vision-Language Representation	Dec 3, 2018	Multi-Task LearningQuestion Answering	—Unverified
NanoMVG: USV-Centric Low-Power Multi-Task Visual Grounding based on Prompt-Guided Camera and 4D mmWave Radar	Aug 30, 2024	Autonomous DrivingVisual Grounding	—Unverified

Show:10 25 50

← PrevPage 14 of 23Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified