Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 381–390 of 571 papers

Title	Date	Tasks	Status	Hype
Visually Grounded VQA by Lattice-based Retrieval	Nov 15, 2022	Information RetrievalQuestion Answering	CodeCode Available	0
Are Current Decoding Strategies Capable of Facing the Challenges of Visual Dialogue?	Oct 24, 2022	InformativenessText Generation	—Unverified	0
Instruction-Following Agents with Multimodal Transformer	Oct 24, 2022	Instruction FollowingVisual Grounding	CodeCode Available	1
RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data	Oct 23, 2022	Image CaptioningImage-text Retrieval	—Unverified	0
A Visual Tour Of Current Challenges In Multimodal Language Models	Oct 22, 2022	Image GenerationText to Image Generation	—Unverified	0
Learning Point-Language Hierarchical Alignment for 3D Visual Grounding	Oct 22, 2022	3D visual groundingSentence	CodeCode Available	1
Vision-Language Pre-training: Basics, Recent Advances, and Future Trends	Oct 17, 2022	Few-Shot LearningImage Captioning	CodeCode Available	3
Like a bilingual baby: The advantage of visually grounding a bilingual language model	Oct 11, 2022	Language ModelingLanguage Modelling	—Unverified	0
YFACC: A Yorùbá speech-image dataset for cross-lingual keyword localisation through visual grounding	Oct 10, 2022	Visual Grounding	—Unverified	0
MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning	Oct 9, 2022	Image-text Retrievalmultimodal interaction	—Unverified	0

Show:10 25 50

← PrevPage 39 of 58Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified