Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 441–450 of 571 papers

Title	Date	Tasks	Status	Hype
RoViST: Learning Robust Metrics for Visual Storytelling	Dec 17, 2021	SentenceText Generation	—Unverified	0
CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision	Dec 14, 2021	Contrastive LearningRepresentation Learning	CodeCode Available	1
D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding	Dec 2, 2021	3D dense captioning3D visual grounding	—Unverified	0
Less is More: Generating Grounded Navigation Instructions from Landmarks	Nov 25, 2021	DecoderInstruction Following	—Unverified	0
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling	Nov 23, 2021	Image CaptioningImage Description	CodeCode Available	1
Grounded Situation Recognition with Transformers	Nov 19, 2021	DecoderGrounded Situation Recognition	CodeCode Available	1
Zero-Shot Visual Grounding of Referring Utterances in Dialogue	Nov 16, 2021	DescriptiveVisual Grounding	—Unverified	0
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts	Nov 16, 2021	Cross-Modal RetrievalImage Captioning	CodeCode Available	1
Attention as Grounding: Exploring Textual and Cross-Modal Attention on Entities and Relations in Language-and-Vision Transformer	Oct 16, 2021	Text GenerationVisual Grounding	—Unverified	0
Efficient Multi-Modal Embeddings from Structured Data	Oct 6, 2021	Semantic SimilaritySemantic Textual Similarity	—Unverified	0

Show:10 25 50

← PrevPage 45 of 58Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified