Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 176–200 of 571 papers

Title	Date	Tasks	Status	Hype
Unveiling and Mitigating Bias in Audio Visual Segmentation	Jul 23, 2024	AttributeVisual Grounding	—Unverified	0
PD-APE: A Parallel Decoding Framework with Adaptive Position Encoding for 3D Visual Grounding	Jul 19, 2024	3D visual groundingAttribute	—Unverified	0
Learning Visual Grounding from Generative Vision and Language Model	Jul 18, 2024	AttributeLanguage Modeling	—Unverified	0
Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models	Jul 18, 2024	3D Semantic SegmentationSemantic Segmentation	—Unverified	0
VIMI: Grounding Video Generation through Multi-modal Instruction	Jul 8, 2024	Text-to-Video GenerationVideo Generation	—Unverified	0
3D Vision and Language Pretraining with Large-Scale Synthetic Data	Jul 8, 2024	Dense CaptioningDiversity	CodeCode Available	1
Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model	Jul 7, 2024	SegmentationSentence	CodeCode Available	0
Multi-branch Collaborative Learning Network for 3D Visual Grounding	Jul 7, 2024	3D visual groundingReferring Expression	CodeCode Available	1
Second Place Solution of WSDM2023 Toloka Visual Question Answering Challenge	Jul 5, 2024	Cross-Modal RetrievalQuestion Answering	—Unverified	0
Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition	Jul 5, 2024	Visual GroundingVisual Storytelling	CodeCode Available	0
Smart Vision-Language Reasoners	Jul 5, 2024	MathMathematical Reasoning	CodeCode Available	0
ACTRESS: Active Retraining for Semi-supervised Visual Grounding	Jul 3, 2024	Binary ClassificationVisual Grounding	—Unverified	0
Visual Grounding with Attention-Driven Constraint Balancing	Jul 3, 2024	Objectobject-detection	—Unverified	0
SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding	Jul 3, 2024	object-detectionObject Detection	CodeCode Available	2
The Solution for the ICCV 2023 Perception Test Challenge 2023 -- Task 6 -- Grounded videoQA	Jul 2, 2024	Grounded Video Question AnsweringObject Tracking	—Unverified	0
CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation	Jul 1, 2024	Image-text RetrievalQuestion Answering	CodeCode Available	1
ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities	Jul 1, 2024	3D visual groundingLanguage Modeling	—Unverified	0
From Local Concepts to Universals: Evaluating the Multicultural Understanding of Vision-Language Models	Jun 28, 2024	DiversityRetrieval	—Unverified	0
FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts	Jun 27, 2024	Decision MakingLogical Reasoning	—Unverified	0
On the Role of Visual Grounding in VQA	Jun 26, 2024	Visual GroundingVisual Question Answering (VQA)	—Unverified	0
Towards Open-World Grasping with Large Vision-Language Models	Jun 26, 2024	Robotic GraspingVisual Grounding	—Unverified	0
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs	Jun 24, 2024	Representation LearningVisual Grounding	CodeCode Available	5
AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention	Jun 18, 2024	ObjectResponse Generation	CodeCode Available	2
VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding	Jun 18, 2024	Image CaptioningQuestion Answering	CodeCode Available	2
Visually Consistent Hierarchical Image Classification	Jun 17, 2024	Classificationimage-classification	—Unverified	0

Show:10 25 50

← PrevPage 8 of 23Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified