Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 151–175 of 571 papers

Title	Date	Tasks	Status	Hype
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks	Oct 7, 2024	Information RetrievalLanguage Modeling	—Unverified	0
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents	Oct 7, 2024	Natural Language Visual GroundingNavigate	CodeCode Available	3
Adaptive Masking Enhances Visual Grounding	Oct 4, 2024	Few-Shot LearningVisual Grounding	CodeCode Available	0
World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering	Sep 30, 2024	Optical Character Recognition (OCR)Question Answering	CodeCode Available	0
Individuation in Neural Models with and without Visual Grounding	Sep 27, 2024	Visual Grounding	—Unverified	0
SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion	Sep 26, 2024	DescriptiveGeneralized Referring Expression Comprehension	CodeCode Available	2
ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue	Sep 26, 2024	Medical Visual Question AnsweringQuestion Answering	—Unverified	0
HiFi-CS: Towards Open Vocabulary Visual Grounding For Robotic Grasping Using Vision-Language Models	Sep 16, 2024	AttributeDecoder	CodeCode Available	0
Bayesian Self-Training for Semi-Supervised 3D Segmentation	Sep 12, 2024	3D Instance Segmentation3D Semantic Segmentation	—Unverified	0
Shaking Up VLMs: Comparing Transformers and Structured State Space Models for Vision & Language Modeling	Sep 9, 2024	Language ModelingLanguage Modelling	CodeCode Available	0
Visual Grounding with Multi-modal Conditional Adaptation	Sep 8, 2024	object-detectionObject Detection	CodeCode Available	1
Visual Prompting in Multimodal Large Language Models: A Survey	Sep 5, 2024	In-Context LearningPrompt Learning	—Unverified	0
Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding	Sep 5, 2024	Question AnsweringScene Understanding	CodeCode Available	2
NanoMVG: USV-Centric Low-Power Multi-Task Visual Grounding based on Prompt-Guided Camera and 4D mmWave Radar	Aug 30, 2024	Autonomous DrivingVisual Grounding	—Unverified	0
ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding	Aug 29, 2024	Data AugmentationImage Generation	CodeCode Available	0
M4CXR: Exploring Multi-task Potentials of Multi-modal Large Language Models for Chest X-ray Interpretation	Aug 29, 2024	Instruction FollowingMedical Report Generation	—Unverified	0
MMR: Evaluating Reading Ability of Large Multimodal Models	Aug 26, 2024	Font RecognitionMMR total	—Unverified	0
IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities	Aug 23, 2024	Language ModelingLanguage Modelling	CodeCode Available	1
Polaris: Open-ended Interactive Robotic Manipulation via Syn2Real Visual Grounding and Large Language Models	Aug 15, 2024	Pose EstimationVisual Grounding	—Unverified	0
In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation	Aug 9, 2024	Image to textObject	CodeCode Available	2
Task-oriented Sequential Grounding in 3D Scenes	Aug 7, 2024	3D visual groundingVisual Grounding	—Unverified	0
Visual Grounding for Object-Level Generalization in Reinforcement Learning	Aug 4, 2024	Language ModellingObject	CodeCode Available	1
An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding	Aug 2, 2024	DecoderReasoning Segmentation	CodeCode Available	1
UOUO: Uncontextualized Uncommon Objects for Measuring Knowledge Horizons of Vision Language Models	Jul 25, 2024	Computational EfficiencyQuestion Answering	—Unverified	0
RefMask3D: Language-Guided Transformer for 3D Referring Segmentation	Jul 25, 2024	3D visual groundingImage Segmentation	CodeCode Available	2

Show:10 25 50

← PrevPage 7 of 23Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified