Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 551–571 of 571 papers

Title	Date	Tasks	Status
Visual Coreference Resolution in Visual Dialog using Neural Module Networks	Sep 6, 2018	Common Sense Reasoningcoreference-resolution	CodeCode Available
Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining	Aug 1, 2018	Question AnsweringVisual Grounding	—Unverified
Illustrative Language Understanding: Large-Scale Visual Grounding with Image Search	Jul 1, 2018	General ClassificationImage Retrieval	—Unverified
Visually grounded cross-lingual keyword spotting in speech	Jun 13, 2018	Keyword SpottingVisual Grounding	—Unverified
Interactive Visual Grounding of Referring Expressions for Human-Robot Interaction	Jun 11, 2018	Question GenerationQuestion-Generation	—Unverified
Finding "It": Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos	Jun 1, 2018	Multiple Instance LearningSentence	—Unverified
Visual Grounding via Accumulated Attention	Jun 1, 2018	SentenceVisual Grounding	—Unverified
Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding	May 9, 2018	DiversityPhrase Grounding	CodeCode Available
Finding beans in burgers: Deep semantic-visual embedding with localization	Apr 5, 2018	Cross-Modal RetrievalImage Captioning	CodeCode Available
Learning Unsupervised Visual Grounding Through Semantic Self-Supervision	Mar 17, 2018	Visual Grounding	—Unverified
Interactive Reinforcement Learning for Object Grounding via Self-Talking	Dec 2, 2017	Objectreinforcement-learning	—Unverified
Improving Visually Grounded Sentence Representations with Self-Attention	Dec 2, 2017	SentenceVisual Grounding	—Unverified
Self-view Grounding Given a Narrated 360° Video	Nov 23, 2017	SentenceVisual Grounding	CodeCode Available
Visual Reference Resolution using Attention Memory for Visual Dialog	Sep 23, 2017	Parameter PredictionQuestion Answering	—Unverified
Weakly-supervised Visual Grounding of Phrases with Linguistic Structures	May 3, 2017	SentenceVisual Grounding	—Unverified
Learning Two-Branch Neural Networks for Image-Text Matching Tasks	Apr 11, 2017	Image-text matchingRetrieval	CodeCode Available
Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation	Jan 28, 2017	Response GenerationRetrieval	—Unverified
Revisiting Visual Question Answering Baselines	Jun 27, 2016	Binary ClassificationMultiple-choice	CodeCode Available
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding	Jun 6, 2016	Phrase GroundingVisual Grounding	CodeCode Available
Visual Word2Vec (vis-w2v): Learning Visually Grounded Word Embeddings Using Abstract Scenes	Nov 22, 2015	Common Sense ReasoningImage Retrieval	CodeCode Available
Grounding of Textual Phrases in Images by Reconstruction	Nov 12, 2015	Language ModelingLanguage Modelling	CodeCode Available

Show:10 25 50

← PrevPage 12 of 12Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified