Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 201–250 of 571 papers

Title	Date	Tasks	Status	Hype
Visually Consistent Hierarchical Image Classification	Jun 17, 2024	Classificationimage-classification	—Unverified	0
Learning Language Structures through Grounding	Jun 14, 2024	Automatic Speech RecognitionDependency Parsing	—Unverified	0
Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding	Jun 13, 2024	3D visual groundingAttribute	—Unverified	0
MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations	Jun 13, 2024	3D visual groundingAttribute	CodeCode Available	4
Towards Vision-Language Geo-Foundation Model: A Survey	Jun 13, 2024	Earth ObservationImage Captioning	CodeCode Available	2
Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based Segmentation	Jun 11, 2024	Grounded Multimodal Named Entity Recognitionnamed-entity-recognition	CodeCode Available	1
Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language	Jun 9, 2024	Contrastive LearningCross-Modal Retrieval	CodeCode Available	2
A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances, and Future Directions	Jun 9, 2024	3D visual groundingSurvey	CodeCode Available	3
F-LMM: Grounding Frozen Large Multimodal Models	Jun 9, 2024	General KnowledgeInstruction Following	CodeCode Available	2
HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task	Jun 4, 2024	Head Pose EstimationLanguage Modelling	—Unverified	0
HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model	Jun 1, 2024	Action RecognitionActivity Recognition	—Unverified	0
Instruction-Guided Visual Masking	May 30, 2024	Instruction FollowingVisual Grounding	CodeCode Available	1
Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention	May 28, 2024	3D Object Detection3D visual grounding	—Unverified	0
LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding	May 27, 2024	Visual Grounding	—Unverified	0
Talk to Parallel LiDARs: A Human-LiDAR Interaction Method Based on 3D Visual Grounding	May 24, 2024	3D visual groundingAutonomous Driving	—Unverified	0
Talk2Radar: Bridging Natural Language with 4D mmWave Radar for 3D Referring Expression Comprehension	May 21, 2024	3D visual groundingReferring Expression	CodeCode Available	1
Adversarial Robustness for Visual Grounding of Multimodal Large Language Models	May 16, 2024	Adversarial AttackAdversarial Robustness	CodeCode Available	0
DARA: Domain- and Relation-aware Adapters Make Parameter-efficient Tuning for Visual Grounding	May 10, 2024	RelationSpatial Reasoning	CodeCode Available	1
Visual grounding for desktop graphical user interfaces	May 5, 2024	Language ModelingLanguage Modelling	—Unverified	0
Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners	Apr 30, 2024	3D visual groundingVisual Grounding	—Unverified	0
BlenderAlchemy: Editing 3D Graphics with Vision-Language Models	Apr 26, 2024	Game DesignImage Generation	—Unverified	0
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs	Apr 25, 2024	Visual GroundingVisual Question Answering	CodeCode Available	2
HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding	Apr 20, 2024	cross-modal alignmentVisual Grounding	CodeCode Available	2
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models	Apr 19, 2024	Language ModelingLanguage Modelling	CodeCode Available	4
Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based Localization	Apr 17, 2024	3D dense captioning3D visual grounding	CodeCode Available	0
MedRG: Medical Report Grounding with Multi-modal Large Language Model	Apr 10, 2024	DecoderLanguage Modeling	—Unverified	0
VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis	Mar 29, 2024	HallucinationImage Captioning	CodeCode Available	2
AgentStudio: A Toolkit for Building General Virtual Agents	Mar 26, 2024	Visual Grounding	CodeCode Available	3
Data-Efficient 3D Visual Grounding via Order-Aware Referring	Mar 25, 2024	3D visual groundingObject	—Unverified	0
Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery	Mar 22, 2024	Language ModelingLanguage Modelling	—Unverified	0
MedPromptX: Grounded Multimodal Prompting for Chest X-ray Diagnosis	Mar 22, 2024	Medical DiagnosisMedical Visual Question Answering	CodeCode Available	2
VidLA: Video-Language Alignment at Scale	Mar 21, 2024	Language ModellingVisual Grounding	—Unverified	0
Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling	Mar 21, 2024	Grounded language learningLanguage Acquisition	CodeCode Available	1
Learning from Synthetic Data for Visual Grounding	Mar 20, 2024	Language ModellingLarge Language Model	—Unverified	0
Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory	Mar 19, 2024	Adversarial TextDiversity	CodeCode Available	1
HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning	Mar 19, 2024	Reinforcement Learning (RL)Visual Grounding	CodeCode Available	1
WaterVG: Waterway Visual Grounding based on Text-Guided Vision and mmWave Radar	Mar 19, 2024	Autonomous NavigationReferring Expression	—Unverified	0
Right Place, Right Time! Dynamizing Topological Graphs for Embodied Navigation	Mar 14, 2024	Decision MakingLanguage Modeling	—Unverified	0
SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph Attention	Mar 13, 2024	3D visual groundingcross-modal alignment	CodeCode Available	0
Detecting Concrete Visual Tokens for Multimodal Machine Translation	Mar 5, 2024	Machine TranslationMultimodal Machine Translation	—Unverified	0
MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding	Mar 5, 2024	3D visual groundingDecision Making	CodeCode Available	1
Adversarial Testing for Visual Grounding via Image-Aware Property Reduction	Mar 2, 2024	Visual Grounding	—Unverified	0
ShapeLLM: Universal 3D Object Understanding for Embodied Interaction	Feb 27, 2024	3D geometry3D Object Captioning	CodeCode Available	3
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web	Feb 27, 2024	Language ModelingLanguage Modelling	—Unverified	0
Seeing is Believing: Mitigating Hallucination in Large Vision-Language Models via CLIP-Guided Decoding	Feb 23, 2024	HallucinationObject	CodeCode Available	1
The Revolution of Multimodal Large Language Models: A Survey	Feb 19, 2024	Image GenerationInstruction Following	CodeCode Available	2
Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with Human Intentions	Feb 17, 2024	Visual Grounding	CodeCode Available	1
LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition	Feb 15, 2024	Grounded Multimodal Named Entity RecognitionMulti-modal Named Entity Recognition	CodeCode Available	1
ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling	Feb 9, 2024	HallucinationNatural Language Understanding	CodeCode Available	0
Neural Slot Interpreters: Grounding Object Semantics in Emergent Slot Representations	Feb 2, 2024	Contrastive LearningObject	—Unverified	0

Show:10 25 50

← PrevPage 5 of 12Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified