Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 76–100 of 571 papers

Title	Date	Tasks	Status	Hype
Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in Clutter	Nov 9, 2023	ObjectVisual Grounding	CodeCode Available	1
Learning Cross-modal Context Graph for Visual Grounding	Feb 13, 2020	Graph MatchingGraph Neural Network	CodeCode Available	1
Local-Global Context Aware Transformer for Language-Guided Video Segmentation	Mar 18, 2022	Referring Expression SegmentationReferring Video Object Segmentation	CodeCode Available	1
Iterative Robust Visual Grounding with Masked Reference based Centerpoint Supervision	Jul 23, 2023	DecoderVisual Grounding	CodeCode Available	1
Instruction-Following Agents with Multimodal Transformer	Oct 24, 2022	Instruction FollowingVisual Grounding	CodeCode Available	1
DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding	Nov 28, 2022	object-detectionObject Detection	CodeCode Available	1
Instruction-Guided Visual Masking	May 30, 2024	Instruction FollowingVisual Grounding	CodeCode Available	1
Joint Visual Grounding and Tracking with Natural Language Specification	Mar 21, 2023	Visual GroundingVisual Tracking	CodeCode Available	1
Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation	Jul 3, 2020	Contrastive LearningKnowledge Distillation	CodeCode Available	1
InfMLLM: A Unified Framework for Visual-Language Tasks	Nov 12, 2023	GPUImage Captioning	CodeCode Available	1
Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations	Jun 30, 2022	Language ModelingLanguage Modelling	CodeCode Available	1
Improving One-stage Visual Grounding by Recursive Sub-query Construction	Aug 3, 2020	SentenceSentence Embedding	CodeCode Available	1
Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning	Apr 30, 2022	AttributeDecoder	CodeCode Available	1
InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring	Mar 1, 2021	3D visual groundingAttribute	CodeCode Available	1
Kosmos-2: Grounding Multimodal Large Language Models to the World	Jun 26, 2023	Image CaptioningIn-Context Learning	CodeCode Available	1
Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding	Nov 25, 2022	3D visual groundingKnowledge Distillation	CodeCode Available	1
CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation	Jul 1, 2024	Image-text RetrievalQuestion Answering	CodeCode Available	1
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling	Nov 23, 2021	Image CaptioningImage Description	CodeCode Available	1
A Unified Framework for 3D Point Cloud Visual Grounding	Aug 23, 2023	CPUGPU	CodeCode Available	1
Cross3DVG: Cross-Dataset 3D Visual Grounding on Different RGB-D Scans	May 23, 2023	3D Reconstruction3D visual grounding	CodeCode Available	1
IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities	Aug 23, 2024	Language ModelingLanguage Modelling	CodeCode Available	1
Cyclic Co-Learning of Sounding Object Visual Grounding and Sound Separation	Apr 5, 2021	ObjectVisual Grounding	CodeCode Available	1
An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding	Aug 2, 2024	DecoderReasoning Segmentation	CodeCode Available	1
CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models	Sep 24, 2021	Visual Grounding	CodeCode Available	1
A Fast and Accurate One-Stage Approach to Visual Grounding	Aug 18, 2019	Referring ExpressionReferring Expression Comprehension	CodeCode Available	1

Show:10 25 50

← PrevPage 4 of 23Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified