Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 111–120 of 571 papers

Title	Date	Tasks	Status	Hype
Referencing Where to Focus: Improving VisualGrounding with Referential Query	Dec 26, 2024	DecoderVisual Grounding	—Unverified	0
Reasoning to Attend: Try to Understand How <SEG> Token Works	Dec 23, 2024	Semantic SimilaritySemantic Textual Similarity	CodeCode Available	2
CoF: Coarse to Fine-Grained Image Understanding for Multi-modal Large Language Models	Dec 22, 2024	Language ModelingLanguage Modelling	CodeCode Available	0
Aria-UI: Visual Grounding for GUI Instructions	Dec 20, 2024	Natural Language Visual GroundingVisual Grounding	CodeCode Available	3
FiVL: A Framework for Improved Vision-Language Alignment	Dec 19, 2024	Answer GenerationMultimodal Reasoning	CodeCode Available	0
EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues	Dec 19, 2024	Change DetectionDisaster Response	—Unverified	0
GAGS: Granularity-Aware Feature Distillation for Language Gaussian Splatting	Dec 18, 2024	Scene UnderstandingSemantic Segmentation	—Unverified	0
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding	Dec 13, 2024	Chart UnderstandingMixture-of-Experts	CodeCode Available	9
Barking Up The Syntactic Tree: Enhancing VLM Training with Syntactic Losses	Dec 11, 2024	Image-text RetrievalQuestion Answering	—Unverified	0
Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models	Dec 11, 2024	Question AnsweringVisual Grounding	CodeCode Available	0

Show:10 25 50

← PrevPage 12 of 58Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified