Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 101–150 of 571 papers

Title	Date	Tasks	Status	Hype
Multi3DRefer: Grounding Text Description to Multiple 3D Objects	Sep 11, 2023	3D visual groundingContrastive Learning	CodeCode Available	1
Multi-branch Collaborative Learning Network for 3D Visual Grounding	Jul 7, 2024	3D visual groundingReferring Expression	CodeCode Available	1
Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling	Feb 7, 2022	Language ModelingLanguage Modelling	CodeCode Available	1
CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation	Jul 1, 2024	Image-text RetrievalQuestion Answering	CodeCode Available	1
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling	Nov 23, 2021	Image CaptioningImage Description	CodeCode Available	1
Multi-View Transformer for 3D Visual Grounding	Apr 5, 2022	3D visual groundingVisual Grounding	CodeCode Available	1
A Unified Framework for 3D Point Cloud Visual Grounding	Aug 23, 2023	CPUGPU	CodeCode Available	1
Cross3DVG: Cross-Dataset 3D Visual Grounding on Different RGB-D Scans	May 23, 2023	3D Reconstruction3D visual grounding	CodeCode Available	1
CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models	Sep 24, 2021	Visual Grounding	CodeCode Available	1
A Fast and Accurate One-Stage Approach to Visual Grounding	Aug 18, 2019	Referring ExpressionReferring Expression Comprehension	CodeCode Available	1
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding	Apr 26, 2021	Generalized Referring Expression ComprehensionPhrase Grounding	CodeCode Available	1
MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding	Mar 5, 2024	3D visual groundingDecision Making	CodeCode Available	1
Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with Human Intentions	Feb 17, 2024	Visual Grounding	CodeCode Available	1
Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding	Jul 18, 2023	3D visual groundingObject	CodeCode Available	1
CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding	Oct 10, 2023	3D visual groundingVisual Grounding	CodeCode Available	1
Mask Grounding for Referring Image Segmentation	Dec 19, 2023	cross-modal alignmentImage Segmentation	CodeCode Available	1
MixGen: A New Multi-Modal Data Augmentation	Jun 16, 2022	Data AugmentationImage-text Retrieval	CodeCode Available	1
LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition	Feb 15, 2024	Grounded Multimodal Named Entity RecognitionMulti-modal Named Entity Recognition	CodeCode Available	1
Local-Global Context Aware Transformer for Language-Guided Video Segmentation	Mar 18, 2022	Referring Expression SegmentationReferring Video Object Segmentation	CodeCode Available	1
Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling	Mar 21, 2024	Grounded language learningLanguage Acquisition	CodeCode Available	1
Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory	Mar 19, 2024	Adversarial TextDiversity	CodeCode Available	1
Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding	Nov 25, 2022	3D visual groundingKnowledge Distillation	CodeCode Available	1
Context Disentangling and Prototype Inheriting for Robust Visual Grounding	Dec 19, 2023	Visual Grounding	CodeCode Available	1
Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards	Jun 7, 2023	DiversityImage Captioning	CodeCode Available	1
3D Vision and Language Pretraining with Large-Scale Synthetic Data	Jul 8, 2024	Dense CaptioningDiversity	CodeCode Available	1
SAT: 2D Semantics Assisted Training for 3D Visual Grounding	May 24, 2021	3D visual groundingObject	CodeCode Available	1
Learning Cross-modal Context Graph for Visual Grounding	Nov 20, 2019	Graph MatchingGraph Neural Network	CodeCode Available	1
Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training	Jan 1, 2023	3D dense captioning3D visual grounding	CodeCode Available	1
Connecting What to Say With Where to Look by Modeling Human Attention Traces	May 12, 2021	Caption GenerationImage Captioning	CodeCode Available	1
Look Before You Leap: Learning Landmark Features for One-Stage Visual Grounding	Apr 9, 2021	DescriptiveObject	CodeCode Available	1
Visual Grounding for Object-Level Generalization in Reinforcement Learning	Aug 4, 2024	Language ModellingObject	CodeCode Available	1
Confidence-aware Pseudo-label Learning for Weakly Supervised Visual Grounding	Jan 1, 2023	DescriptiveObject	CodeCode Available	1
PAINT: Paying Attention to INformed Tokens to Mitigate Hallucination in Large Vision-Language Model	Jan 21, 2025	HallucinationImage Captioning	CodeCode Available	1
Joint Visual Grounding and Tracking with Natural Language Specification	Mar 21, 2023	Visual GroundingVisual Tracking	CodeCode Available	1
Kosmos-2: Grounding Multimodal Large Language Models to the World	Jun 26, 2023	Image CaptioningIn-Context Learning	CodeCode Available	1
Instruction-Guided Visual Masking	May 30, 2024	Instruction FollowingVisual Grounding	CodeCode Available	1
Advancing Visual Grounding with Scene Knowledge: Benchmark and Method	Jul 21, 2023	Image-text matchingText Matching	CodeCode Available	1
Instruction-Following Agents with Multimodal Transformer	Oct 24, 2022	Instruction FollowingVisual Grounding	CodeCode Available	1
Collaborative Transformers for Grounded Situation Recognition	Mar 30, 2022	Grounded Situation RecognitionImage Classification	CodeCode Available	1
GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection	Nov 5, 2023	Anomaly DetectionQuestion Answering	CodeCode Available	1
InfMLLM: A Unified Framework for Visual-Language Tasks	Nov 12, 2023	GPUImage Captioning	CodeCode Available	1
InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring	Mar 1, 2021	3D visual groundingAttribute	CodeCode Available	1
Iterative Robust Visual Grounding with Masked Reference based Centerpoint Supervision	Jul 23, 2023	DecoderVisual Grounding	CodeCode Available	1
Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in Clutter	Nov 9, 2023	ObjectVisual Grounding	CodeCode Available	1
Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving	May 13, 2025	3D visual groundingAutonomous Driving	CodeCode Available	1
Improving One-stage Visual Grounding by Recursive Sub-query Construction	Aug 3, 2020	SentenceSentence Embedding	CodeCode Available	1
CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding	May 15, 2023	DiversityTransfer Learning	CodeCode Available	1
Fine-Grained Semantically Aligned Vision-Language Pre-Training	Aug 4, 2022	cross-modal alignmentobject-detection	CodeCode Available	1
Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations	Jun 30, 2022	Language ModelingLanguage Modelling	CodeCode Available	1
CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision	Dec 14, 2021	Contrastive LearningRepresentation Learning	CodeCode Available	1

Show:10 25 50

← PrevPage 3 of 12Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified