Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 151–200 of 571 papers

Title	Date	Tasks	Status	Hype
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents	Oct 7, 2024	Natural Language Visual GroundingNavigate	CodeCode Available	3
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks	Oct 7, 2024	Information RetrievalLanguage Modeling	—Unverified	0
Adaptive Masking Enhances Visual Grounding	Oct 4, 2024	Few-Shot LearningVisual Grounding	CodeCode Available	0
World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering	Sep 30, 2024	Optical Character Recognition (OCR)Question Answering	CodeCode Available	0
Individuation in Neural Models with and without Visual Grounding	Sep 27, 2024	Visual Grounding	—Unverified	0
ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue	Sep 26, 2024	Medical Visual Question AnsweringQuestion Answering	—Unverified	0
SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion	Sep 26, 2024	DescriptiveGeneralized Referring Expression Comprehension	CodeCode Available	2
HiFi-CS: Towards Open Vocabulary Visual Grounding For Robotic Grasping Using Vision-Language Models	Sep 16, 2024	AttributeDecoder	CodeCode Available	0
Bayesian Self-Training for Semi-Supervised 3D Segmentation	Sep 12, 2024	3D Instance Segmentation3D Semantic Segmentation	—Unverified	0
Shaking Up VLMs: Comparing Transformers and Structured State Space Models for Vision & Language Modeling	Sep 9, 2024	Language ModelingLanguage Modelling	CodeCode Available	0
Visual Grounding with Multi-modal Conditional Adaptation	Sep 8, 2024	object-detectionObject Detection	CodeCode Available	1
Visual Prompting in Multimodal Large Language Models: A Survey	Sep 5, 2024	In-Context LearningPrompt Learning	—Unverified	0
Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding	Sep 5, 2024	Question AnsweringScene Understanding	CodeCode Available	2
NanoMVG: USV-Centric Low-Power Multi-Task Visual Grounding based on Prompt-Guided Camera and 4D mmWave Radar	Aug 30, 2024	Autonomous DrivingVisual Grounding	—Unverified	0
ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding	Aug 29, 2024	Data AugmentationImage Generation	CodeCode Available	0
M4CXR: Exploring Multi-task Potentials of Multi-modal Large Language Models for Chest X-ray Interpretation	Aug 29, 2024	Instruction FollowingMedical Report Generation	—Unverified	0
MMR: Evaluating Reading Ability of Large Multimodal Models	Aug 26, 2024	Font RecognitionMMR total	—Unverified	0
IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities	Aug 23, 2024	Language ModelingLanguage Modelling	CodeCode Available	1
Polaris: Open-ended Interactive Robotic Manipulation via Syn2Real Visual Grounding and Large Language Models	Aug 15, 2024	Pose EstimationVisual Grounding	—Unverified	0
In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation	Aug 9, 2024	Image to textObject	CodeCode Available	2
Task-oriented Sequential Grounding in 3D Scenes	Aug 7, 2024	3D visual groundingVisual Grounding	—Unverified	0
Visual Grounding for Object-Level Generalization in Reinforcement Learning	Aug 4, 2024	Language ModellingObject	CodeCode Available	1
An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding	Aug 2, 2024	DecoderReasoning Segmentation	CodeCode Available	1
UOUO: Uncontextualized Uncommon Objects for Measuring Knowledge Horizons of Vision Language Models	Jul 25, 2024	Computational EfficiencyQuestion Answering	—Unverified	0
RefMask3D: Language-Guided Transformer for 3D Referring Segmentation	Jul 25, 2024	3D visual groundingImage Segmentation	CodeCode Available	2
Unveiling and Mitigating Bias in Audio Visual Segmentation	Jul 23, 2024	AttributeVisual Grounding	—Unverified	0
PD-APE: A Parallel Decoding Framework with Adaptive Position Encoding for 3D Visual Grounding	Jul 19, 2024	3D visual groundingAttribute	—Unverified	0
Learning Visual Grounding from Generative Vision and Language Model	Jul 18, 2024	AttributeLanguage Modeling	—Unverified	0
Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models	Jul 18, 2024	3D Semantic SegmentationSemantic Segmentation	—Unverified	0
VIMI: Grounding Video Generation through Multi-modal Instruction	Jul 8, 2024	Text-to-Video GenerationVideo Generation	—Unverified	0
3D Vision and Language Pretraining with Large-Scale Synthetic Data	Jul 8, 2024	Dense CaptioningDiversity	CodeCode Available	1
Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model	Jul 7, 2024	SegmentationSentence	CodeCode Available	0
Multi-branch Collaborative Learning Network for 3D Visual Grounding	Jul 7, 2024	3D visual groundingReferring Expression	CodeCode Available	1
Second Place Solution of WSDM2023 Toloka Visual Question Answering Challenge	Jul 5, 2024	Cross-Modal RetrievalQuestion Answering	—Unverified	0
Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition	Jul 5, 2024	Visual GroundingVisual Storytelling	CodeCode Available	0
Smart Vision-Language Reasoners	Jul 5, 2024	MathMathematical Reasoning	CodeCode Available	0
ACTRESS: Active Retraining for Semi-supervised Visual Grounding	Jul 3, 2024	Binary ClassificationVisual Grounding	—Unverified	0
Visual Grounding with Attention-Driven Constraint Balancing	Jul 3, 2024	Objectobject-detection	—Unverified	0
SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding	Jul 3, 2024	object-detectionObject Detection	CodeCode Available	2
The Solution for the ICCV 2023 Perception Test Challenge 2023 -- Task 6 -- Grounded videoQA	Jul 2, 2024	Grounded Video Question AnsweringObject Tracking	—Unverified	0
CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation	Jul 1, 2024	Image-text RetrievalQuestion Answering	CodeCode Available	1
ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities	Jul 1, 2024	3D visual groundingLanguage Modeling	—Unverified	0
From Local Concepts to Universals: Evaluating the Multicultural Understanding of Vision-Language Models	Jun 28, 2024	DiversityRetrieval	—Unverified	0
FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts	Jun 27, 2024	Decision MakingLogical Reasoning	—Unverified	0
On the Role of Visual Grounding in VQA	Jun 26, 2024	Visual GroundingVisual Question Answering (VQA)	—Unverified	0
Towards Open-World Grasping with Large Vision-Language Models	Jun 26, 2024	Robotic GraspingVisual Grounding	—Unverified	0
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs	Jun 24, 2024	Representation LearningVisual Grounding	CodeCode Available	5
AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention	Jun 18, 2024	ObjectResponse Generation	CodeCode Available	2
VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding	Jun 18, 2024	Image CaptioningQuestion Answering	CodeCode Available	2
Visually Consistent Hierarchical Image Classification	Jun 17, 2024	Classificationimage-classification	—Unverified	0

Show:10 25 50

← PrevPage 4 of 12Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified