Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 351–375 of 571 papers

Title	Date	Tasks	Status	Hype
TreePrompt: Learning to Compose Tree Prompts for Explainable Visual Grounding	May 19, 2023	SentenceVisual Grounding	—Unverified	0
Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding	May 18, 2023	Contrastive LearningObject	—Unverified	0
CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding	May 15, 2023	DiversityTransfer Learning	CodeCode Available	1
Sample-Specific Debiasing for Better Image-Text Models	Apr 25, 2023	Contrastive LearningCross-Modal Retrieval	—Unverified	0
Movie Box Office Prediction With Self-Supervised and Visually Grounded Pretraining	Apr 20, 2023	Visual Grounding	—Unverified	0
WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language	Apr 12, 2023	3D visual groundingAutonomous Driving	CodeCode Available	0
ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance	Mar 29, 2023	3D visual groundingVisual Grounding	CodeCode Available	1
ScanERU: Interactive 3D Visual Grounding based on Embodied Reference Understanding	Mar 23, 2023	3D visual groundingVisual Grounding	CodeCode Available	0
Joint Visual Grounding and Tracking with Natural Language Specification	Mar 21, 2023	Visual GroundingVisual Tracking	CodeCode Available	1
Medical Phrase Grounding with Region-Phrase Context Contrastive Alignment	Mar 14, 2023	Medical Image AnalysisPhrase Grounding	—Unverified	0
Parallel Vertex Diffusion for Unified Visual Grounding	Mar 13, 2023	Visual Grounding	—Unverified	0
Focusing On Targets For Improving Weakly Supervised Visual Grounding	Feb 22, 2023	Dependency ParsingObject	—Unverified	0
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video	Feb 1, 2023	Action ClassificationImage Classification	CodeCode Available	4
Champion Solution for the WSDM2023 Toloka VQA Challenge	Jan 22, 2023	Question AnsweringVisual Grounding	CodeCode Available	3
Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks	Jan 12, 2023	Cross-Modal RetrievalOpen-Ended Question Answering	CodeCode Available	0
ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding	Jan 1, 2023	3D visual groundingVisual Grounding	—Unverified	0
CoSign: Exploring Co-occurrence Signals in Skeleton-based Continuous Sign Language Recognition	Jan 1, 2023	Sign Language RecognitionVisual Grounding	—Unverified	0
Confidence-aware Pseudo-label Learning for Weakly Supervised Visual Grounding	Jan 1, 2023	DescriptiveObject	CodeCode Available	1
Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training	Jan 1, 2023	3D dense captioning3D visual grounding	CodeCode Available	1
Dynamic Inference With Grounding Based Vision and Language Models	Jan 1, 2023	Language ModellingReferring Expression	—Unverified	0
GAFNet: A Global Fourier Self Attention Based Novel Network for multi-modal downstream tasks	Jan 1, 2023	Image GenerationImage-text Retrieval	—Unverified	0
Position-guided Text Prompt for Vision-Language Pre-training	Dec 19, 2022	Cross-Modal RetrievalImage Captioning	CodeCode Available	1
Using Multiple Instance Learning to Build Multimodal Representations	Dec 11, 2022	Contrastive LearningCross-Modal Retrieval	—Unverified	0
UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding	Dec 1, 2022	3D dense captioning3D visual grounding	—Unverified	0
DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding	Nov 28, 2022	object-detectionObject Detection	CodeCode Available	1

Show:10 25 50

← PrevPage 15 of 23Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified