Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 376–400 of 571 papers

Title	Date	Tasks	Status	Hype
MNER-QG: An End-to-End MRC framework for Multimodal Named Entity Recognition with Query Grounding	Nov 27, 2022	named-entity-recognitionNamed Entity Recognition	—Unverified	0
Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding	Nov 25, 2022	3D visual groundingKnowledge Distillation	CodeCode Available	1
X^2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks	Nov 22, 2022	AllCross-Modal Retrieval	CodeCode Available	2
A survey on knowledge-enhanced multimodal learning	Nov 19, 2022	Conditional Image GenerationFactual Visual Question Answering	—Unverified	0
YORO -- Lightweight End to End Visual Grounding	Nov 15, 2022	Natural Language QueriesVisual Grounding	CodeCode Available	1
Visually Grounded VQA by Lattice-based Retrieval	Nov 15, 2022	Information RetrievalQuestion Answering	CodeCode Available	0
Are Current Decoding Strategies Capable of Facing the Challenges of Visual Dialogue?	Oct 24, 2022	InformativenessText Generation	—Unverified	0
Instruction-Following Agents with Multimodal Transformer	Oct 24, 2022	Instruction FollowingVisual Grounding	CodeCode Available	1
RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data	Oct 23, 2022	Image CaptioningImage-text Retrieval	—Unverified	0
A Visual Tour Of Current Challenges In Multimodal Language Models	Oct 22, 2022	Image GenerationText to Image Generation	—Unverified	0
Learning Point-Language Hierarchical Alignment for 3D Visual Grounding	Oct 22, 2022	3D visual groundingSentence	CodeCode Available	1
Vision-Language Pre-training: Basics, Recent Advances, and Future Trends	Oct 17, 2022	Few-Shot LearningImage Captioning	CodeCode Available	3
Like a bilingual baby: The advantage of visually grounding a bilingual language model	Oct 11, 2022	Language ModelingLanguage Modelling	—Unverified	0
YFACC: A Yorùbá speech-image dataset for cross-lingual keyword localisation through visual grounding	Oct 10, 2022	Visual Grounding	—Unverified	0
MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning	Oct 9, 2022	Image-text Retrievalmultimodal interaction	—Unverified	0
Enhancing Interpretability and Interactivity in Robot Manipulation: A Neurosymbolic Approach	Oct 3, 2022	Referring ExpressionRobot Manipulation	CodeCode Available	0
Cost-Effective Language Driven Image Editing with LX-DRIM	Oct 1, 2022	Visual Grounding	CodeCode Available	0
GRAVL-BERT: Graphical Visual-Linguistic Representations for Multimodal Coreference Resolution	Oct 1, 2022	coreference-resolutionCoreference Resolution	CodeCode Available	1
Differentiable Parsing and Visual Grounding of Natural Language Instructions for Object Placement	Oct 1, 2022	Graph Neural NetworkObject	—Unverified	0
EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding	Sep 29, 2022	3D visual groundingObject	CodeCode Available	1
Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding	Sep 28, 2022	DecoderVisual Grounding	—Unverified	0
Introspective Learning : A Two-Stage Approach for Inference in Neural Networks	Sep 17, 2022	Active LearningDecision Making	CodeCode Available	0
Visual Grounding of Inter-lingual Word-Embeddings	Sep 8, 2022	Visual GroundingWord Embeddings	—Unverified	0
Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment	Aug 29, 2022	cross-modal alignmentImage-text Retrieval	CodeCode Available	1
VLMAE: Vision-Language Masked Autoencoder	Aug 19, 2022	Image-text RetrievalLanguage Modeling	—Unverified	0

Show:10 25 50

← PrevPage 16 of 23Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified