Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 501–550 of 571 papers

Title	Date	Tasks	Status
Barking Up The Syntactic Tree: Enhancing VLM Training with Syntactic Losses	Dec 11, 2024	Image-text RetrievalQuestion Answering	—Unverified
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos	Nov 7, 2024	DecoderLanguage Modeling	—Unverified
VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos	Jan 1, 2025	Large Language ModelVideo Segmentation	—Unverified
A Visual Tour Of Current Challenges In Multimodal Language Models	Oct 22, 2022	Image GenerationText to Image Generation	—Unverified
VidLA: Video-Language Alignment at Scale	Mar 21, 2024	Language ModellingVisual Grounding	—Unverified
Viewpoint-Aware Visual Grounding in 3D Scenes	Jan 1, 2024	3D visual groundingReferring Expression	—Unverified
A Vision Centric Remote Sensing Benchmark	Mar 20, 2025	Question AnsweringRepresentation Learning	—Unverified
ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding	Jan 1, 2023	3D visual groundingVisual Grounding	—Unverified
ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition	Jul 15, 2025	3D visual groundingVisual Grounding	—Unverified
ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding	Jan 2, 2025	3D visual groundingDiagnostic	—Unverified
3D Scene Graph Guided Vision-Language Pre-training	Nov 27, 2024	3D dense captioning3D visual grounding	—Unverified
YFACC: A Yorùbá speech-image dataset for cross-lingual keyword localisation through visual grounding	Oct 10, 2022	Visual Grounding	—Unverified
AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring	Jan 16, 2025	3D visual groundingDecoder	—Unverified
VIMI: Grounding Video Generation through Multi-modal Instruction	Jul 8, 2024	Text-to-Video GenerationVideo Generation	—Unverified
Attention-Based Keyword Localisation in Speech using Visual Grounding	Jun 16, 2021	Visual Grounding	—Unverified
Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding	May 18, 2023	Contrastive LearningObject	—Unverified
VIST-GPT: Ushering in the Era of Visual Storytelling with LLMs?	Apr 27, 2025	Visual GroundingVisual Storytelling	—Unverified
3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding	Jul 25, 2023	3D visual groundingObject	—Unverified
Zero-Shot Visual Grounding of Referring Utterances in Dialogue	Nov 16, 2021	DescriptiveVisual Grounding	—Unverified
Visual Grounding Annotation of Recipe Flow Graph	May 1, 2020	Visual Grounding	—Unverified
Learning from Synthetic Data for Visual Grounding	Mar 20, 2024	Language ModellingLarge Language Model	—Unverified
Visually Consistent Hierarchical Image Classification	Jun 17, 2024	Classificationimage-classification	—Unverified
Learning Language Structures through Grounding	Jun 14, 2024	Automatic Speech RecognitionDependency Parsing	—Unverified
Visual grounding for desktop graphical user interfaces	May 5, 2024	Language ModelingLanguage Modelling	—Unverified
Learning to Compose and Reason with Language Tree Structures for Visual Grounding	Jun 5, 2019	Visual GroundingVisual Reasoning	—Unverified
Attention as Grounding: Exploring Textual and Cross-Modal Attention on Entities and Relations in Language-and-Vision Transformer	Oct 16, 2021	Text GenerationVisual Grounding	—Unverified
Learning to Ground VLMs without Forgetting	Oct 14, 2024	DecoderLanguage Modelling	—Unverified
Attending Self-Attention: A Case Study of Visually Grounded Supervision in Vision-and-Language Transformers	Aug 1, 2021	Language ModelingLanguage Modelling	—Unverified
Learning Unsupervised Visual Grounding Through Semantic Self-Supervision	Mar 17, 2018	Visual Grounding	—Unverified
Learning Visual Grounding from Generative Vision and Language Model	Jul 18, 2024	AttributeLanguage Modeling	—Unverified
Learning with Difference Attention for Visually Grounded Self-supervised Representations	Jun 26, 2023	Self-Supervised LearningVisual Grounding	—Unverified
How direct is the link between words and images?	Jun 30, 2022	Visual GroundingWord Embeddings	—Unverified
Less is More: Generating Grounded Navigation Instructions from Landmarks	Nov 25, 2021	DecoderInstruction Following	—Unverified
Visual Grounding of Inter-lingual Word-Embeddings	Sep 8, 2022	Visual GroundingWord Embeddings	—Unverified
Leveraging Multimodal-LLMs Assisted by Instance Segmentation for Intelligent Traffic Monitoring	Feb 16, 2025	Instance SegmentationLanguage Modeling	—Unverified
Leveraging Past References for Robust Language Grounding	Nov 1, 2019	ObjectReferring Expression	—Unverified
A survey on knowledge-enhanced multimodal learning	Nov 19, 2022	Conditional Image GenerationFactual Visual Question Answering	—Unverified
LCV2: An Efficient Pretraining-Free Framework for Grounded Visual Question Answering	Jan 29, 2024	Language ModelingLanguage Modelling	—Unverified
LanguageRefer: Spatial-Language Model for 3D Visual Grounding	Jul 7, 2021	3D visual groundingLanguage Modeling	—Unverified
LidaRefer: Outdoor 3D Visual Grounding for Autonomous Driving with Transformers	Nov 7, 2024	3D visual groundingAutonomous Driving	—Unverified
Lightweight In-Context Tuning for Multimodal Unified Models	Oct 8, 2023	Image CaptioningIn-Context Learning	—Unverified
Like a bilingual baby: The advantage of visually grounding a bilingual language model	Oct 11, 2022	Language ModelingLanguage Modelling	—Unverified
Language learning using Speech to Image retrieval	Sep 9, 2019	Grounded language learningImage Retrieval	—Unverified
Language-Guided 3D Object Detection in Point Cloud for Autonomous Driving	May 25, 2023	3D Object DetectionAutonomous Driving	—Unverified
LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding	May 27, 2024	Visual Grounding	—Unverified
Knowledge Supports Visual Language Grounding: A Case Study on Colour Terms	Jul 1, 2020	DiagnosticObject	—Unverified
Joint Top-Down and Bottom-Up Frameworks for 3D Visual Grounding	Oct 21, 2024	3D visual groundingObject	—Unverified
I Speak and You Find: Robust 3D Visual Grounding with Noisy and Ambiguous Speech Inputs	Jun 17, 2025	3D visual groundingContrastive Learning	—Unverified
INVIGORATE: Interactive Visual Grounding and Grasping in Clutter	Aug 25, 2021	BlockingObject	—Unverified
LQMFormer: Language-aware Query Mask Transformer for Referring Image Segmentation	Jan 1, 2024	Image SegmentationSemantic Segmentation	—Unverified

Show:10 25 50

← PrevPage 11 of 12Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified