Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 351–400 of 571 papers

Title	Date	Tasks	Status	Hype
TreePrompt: Learning to Compose Tree Prompts for Explainable Visual Grounding	May 19, 2023	SentenceVisual Grounding	—Unverified	0
Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding	May 18, 2023	Contrastive LearningObject	—Unverified	0
CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding	May 15, 2023	DiversityTransfer Learning	CodeCode Available	1
Sample-Specific Debiasing for Better Image-Text Models	Apr 25, 2023	Contrastive LearningCross-Modal Retrieval	—Unverified	0
Movie Box Office Prediction With Self-Supervised and Visually Grounded Pretraining	Apr 20, 2023	Visual Grounding	—Unverified	0
WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language	Apr 12, 2023	3D visual groundingAutonomous Driving	CodeCode Available	0
ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance	Mar 29, 2023	3D visual groundingVisual Grounding	CodeCode Available	1
ScanERU: Interactive 3D Visual Grounding based on Embodied Reference Understanding	Mar 23, 2023	3D visual groundingVisual Grounding	CodeCode Available	0
Joint Visual Grounding and Tracking with Natural Language Specification	Mar 21, 2023	Visual GroundingVisual Tracking	CodeCode Available	1
Medical Phrase Grounding with Region-Phrase Context Contrastive Alignment	Mar 14, 2023	Medical Image AnalysisPhrase Grounding	—Unverified	0
Parallel Vertex Diffusion for Unified Visual Grounding	Mar 13, 2023	Visual Grounding	—Unverified	0
Focusing On Targets For Improving Weakly Supervised Visual Grounding	Feb 22, 2023	Dependency ParsingObject	—Unverified	0
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video	Feb 1, 2023	Action ClassificationImage Classification	CodeCode Available	4
Champion Solution for the WSDM2023 Toloka VQA Challenge	Jan 22, 2023	Question AnsweringVisual Grounding	CodeCode Available	3
Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks	Jan 12, 2023	Cross-Modal RetrievalOpen-Ended Question Answering	CodeCode Available	0
ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding	Jan 1, 2023	3D visual groundingVisual Grounding	—Unverified	0
CoSign: Exploring Co-occurrence Signals in Skeleton-based Continuous Sign Language Recognition	Jan 1, 2023	Sign Language RecognitionVisual Grounding	—Unverified	0
Confidence-aware Pseudo-label Learning for Weakly Supervised Visual Grounding	Jan 1, 2023	DescriptiveObject	CodeCode Available	1
Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training	Jan 1, 2023	3D dense captioning3D visual grounding	CodeCode Available	1
Dynamic Inference With Grounding Based Vision and Language Models	Jan 1, 2023	Language ModellingReferring Expression	—Unverified	0
GAFNet: A Global Fourier Self Attention Based Novel Network for multi-modal downstream tasks	Jan 1, 2023	Image GenerationImage-text Retrieval	—Unverified	0
Position-guided Text Prompt for Vision-Language Pre-training	Dec 19, 2022	Cross-Modal RetrievalImage Captioning	CodeCode Available	1
Using Multiple Instance Learning to Build Multimodal Representations	Dec 11, 2022	Contrastive LearningCross-Modal Retrieval	—Unverified	0
UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding	Dec 1, 2022	3D dense captioning3D visual grounding	—Unverified	0
DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding	Nov 28, 2022	object-detectionObject Detection	CodeCode Available	1
MNER-QG: An End-to-End MRC framework for Multimodal Named Entity Recognition with Query Grounding	Nov 27, 2022	named-entity-recognitionNamed Entity Recognition	—Unverified	0
Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding	Nov 25, 2022	3D visual groundingKnowledge Distillation	CodeCode Available	1
X^2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks	Nov 22, 2022	AllCross-Modal Retrieval	CodeCode Available	2
A survey on knowledge-enhanced multimodal learning	Nov 19, 2022	Conditional Image GenerationFactual Visual Question Answering	—Unverified	0
YORO -- Lightweight End to End Visual Grounding	Nov 15, 2022	Natural Language QueriesVisual Grounding	CodeCode Available	1
Visually Grounded VQA by Lattice-based Retrieval	Nov 15, 2022	Information RetrievalQuestion Answering	CodeCode Available	0
Are Current Decoding Strategies Capable of Facing the Challenges of Visual Dialogue?	Oct 24, 2022	InformativenessText Generation	—Unverified	0
Instruction-Following Agents with Multimodal Transformer	Oct 24, 2022	Instruction FollowingVisual Grounding	CodeCode Available	1
RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data	Oct 23, 2022	Image CaptioningImage-text Retrieval	—Unverified	0
A Visual Tour Of Current Challenges In Multimodal Language Models	Oct 22, 2022	Image GenerationText to Image Generation	—Unverified	0
Learning Point-Language Hierarchical Alignment for 3D Visual Grounding	Oct 22, 2022	3D visual groundingSentence	CodeCode Available	1
Vision-Language Pre-training: Basics, Recent Advances, and Future Trends	Oct 17, 2022	Few-Shot LearningImage Captioning	CodeCode Available	3
Like a bilingual baby: The advantage of visually grounding a bilingual language model	Oct 11, 2022	Language ModelingLanguage Modelling	—Unverified	0
YFACC: A Yorùbá speech-image dataset for cross-lingual keyword localisation through visual grounding	Oct 10, 2022	Visual Grounding	—Unverified	0
MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning	Oct 9, 2022	Image-text Retrievalmultimodal interaction	—Unverified	0
Enhancing Interpretability and Interactivity in Robot Manipulation: A Neurosymbolic Approach	Oct 3, 2022	Referring ExpressionRobot Manipulation	CodeCode Available	0
Cost-Effective Language Driven Image Editing with LX-DRIM	Oct 1, 2022	Visual Grounding	CodeCode Available	0
GRAVL-BERT: Graphical Visual-Linguistic Representations for Multimodal Coreference Resolution	Oct 1, 2022	coreference-resolutionCoreference Resolution	CodeCode Available	1
Differentiable Parsing and Visual Grounding of Natural Language Instructions for Object Placement	Oct 1, 2022	Graph Neural NetworkObject	—Unverified	0
EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding	Sep 29, 2022	3D visual groundingObject	CodeCode Available	1
Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding	Sep 28, 2022	DecoderVisual Grounding	—Unverified	0
Introspective Learning : A Two-Stage Approach for Inference in Neural Networks	Sep 17, 2022	Active LearningDecision Making	CodeCode Available	0
Visual Grounding of Inter-lingual Word-Embeddings	Sep 8, 2022	Visual GroundingWord Embeddings	—Unverified	0
Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment	Aug 29, 2022	cross-modal alignmentImage-text Retrieval	CodeCode Available	1
VLMAE: Vision-Language Masked Autoencoder	Aug 19, 2022	Image-text RetrievalLanguage Modeling	—Unverified	0

Show:10 25 50

← PrevPage 8 of 12Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified