Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 176–200 of 571 papers

Title	Date	Tasks	Status	Hype
CityRefer: Geography-aware 3D Visual Grounding Dataset on City-scale Point Cloud Data	Oct 28, 2023	3D visual groundingAutonomous Vehicles	CodeCode Available	1
GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging Cross-Modal Attention with Large Language Models	Dec 6, 2023	Autonomous DrivingAutonomous Vehicles	CodeCode Available	1
GRAVL-BERT: Graphical Visual-Linguistic Representations for Multimodal Coreference Resolution	Oct 1, 2022	coreference-resolutionCoreference Resolution	CodeCode Available	1
Local-Global Context Aware Transformer for Language-Guided Video Segmentation	Mar 18, 2022	Referring Expression SegmentationReferring Video Object Segmentation	CodeCode Available	1
Look Before You Leap: Learning Landmark Features for One-Stage Visual Grounding	Apr 9, 2021	DescriptiveObject	CodeCode Available	1
REX: Reasoning-aware and Grounded Explanation	Mar 11, 2022	Decision MakingExplanation Generation	CodeCode Available	1
SAT: 2D Semantics Assisted Training for 3D Visual Grounding	May 24, 2021	3D visual groundingObject	CodeCode Available	1
Grounded Situation Recognition with Transformers	Nov 19, 2021	DecoderGrounded Situation Recognition	CodeCode Available	1
CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models	Sep 24, 2021	Visual Grounding	CodeCode Available	1
Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling	Mar 21, 2024	Grounded language learningLanguage Acquisition	CodeCode Available	1
Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment	Aug 29, 2022	cross-modal alignmentImage-text Retrieval	CodeCode Available	1
3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive Selection	Apr 13, 2022	3D visual groundingVisual Grounding	CodeCode Available	1
EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding	Sep 29, 2022	3D visual groundingObject	CodeCode Available	1
Guessing State Tracking for Visual Dialogue	Feb 24, 2020	Visual Grounding	CodeCode Available	1
CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation	Jul 1, 2024	Image-text RetrievalQuestion Answering	CodeCode Available	1
Learning Cross-modal Context Graph for Visual Grounding	Feb 13, 2020	Graph MatchingGraph Neural Network	CodeCode Available	1
Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations	Jun 30, 2022	Language ModelingLanguage Modelling	CodeCode Available	1
Learning Cross-modal Context Graph for Visual Grounding	Nov 20, 2019	Graph MatchingGraph Neural Network	CodeCode Available	1
LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition	Feb 15, 2024	Grounded Multimodal Named Entity RecognitionMulti-modal Named Entity Recognition	CodeCode Available	1
Cyclic Co-Learning of Sounding Object Visual Grounding and Sound Separation	Apr 5, 2021	ObjectVisual Grounding	CodeCode Available	1
Learning Point-Language Hierarchical Alignment for 3D Visual Grounding	Oct 22, 2022	3D visual groundingSentence	CodeCode Available	1
An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding	Aug 2, 2024	DecoderReasoning Segmentation	CodeCode Available	1
STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection	Apr 3, 2025	Instruction FollowingLanguage Modeling	CodeCode Available	1
OCID-Ref: A 3D Robotic Dataset with Embodied Language for Clutter Scene Grounding	Mar 13, 2021	Referring ExpressionReferring Expression Segmentation	CodeCode Available	1
Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding	Mar 29, 2022	Multimodal ReasoningVisual Grounding	CodeCode Available	1

Show:10 25 50

← PrevPage 8 of 23Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified