Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 351–400 of 571 papers

Title	Date	Tasks	Status
Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention	May 28, 2024	3D Object Detection3D visual grounding	—Unverified
LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding	May 27, 2024	Visual Grounding	—Unverified
Talk to Parallel LiDARs: A Human-LiDAR Interaction Method Based on 3D Visual Grounding	May 24, 2024	3D visual groundingAutonomous Driving	—Unverified
Adversarial Robustness for Visual Grounding of Multimodal Large Language Models	May 16, 2024	Adversarial AttackAdversarial Robustness	CodeCode Available
Visual grounding for desktop graphical user interfaces	May 5, 2024	Language ModelingLanguage Modelling	—Unverified
Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners	Apr 30, 2024	3D visual groundingVisual Grounding	—Unverified
BlenderAlchemy: Editing 3D Graphics with Vision-Language Models	Apr 26, 2024	Game DesignImage Generation	—Unverified
Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based Localization	Apr 17, 2024	3D dense captioning3D visual grounding	CodeCode Available
MedRG: Medical Report Grounding with Multi-modal Large Language Model	Apr 10, 2024	DecoderLanguage Modeling	—Unverified
Data-Efficient 3D Visual Grounding via Order-Aware Referring	Mar 25, 2024	3D visual groundingObject	—Unverified
Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery	Mar 22, 2024	Language ModelingLanguage Modelling	—Unverified
VidLA: Video-Language Alignment at Scale	Mar 21, 2024	Language ModellingVisual Grounding	—Unverified
Learning from Synthetic Data for Visual Grounding	Mar 20, 2024	Language ModellingLarge Language Model	—Unverified
WaterVG: Waterway Visual Grounding based on Text-Guided Vision and mmWave Radar	Mar 19, 2024	Autonomous NavigationReferring Expression	—Unverified
Right Place, Right Time! Dynamizing Topological Graphs for Embodied Navigation	Mar 14, 2024	Decision MakingLanguage Modeling	—Unverified
SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph Attention	Mar 13, 2024	3D visual groundingcross-modal alignment	CodeCode Available
Detecting Concrete Visual Tokens for Multimodal Machine Translation	Mar 5, 2024	Machine TranslationMultimodal Machine Translation	—Unverified
Adversarial Testing for Visual Grounding via Image-Aware Property Reduction	Mar 2, 2024	Visual Grounding	—Unverified
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web	Feb 27, 2024	Language ModelingLanguage Modelling	—Unverified
ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling	Feb 9, 2024	HallucinationNatural Language Understanding	CodeCode Available
Neural Slot Interpreters: Grounding Object Semantics in Emergent Slot Representations	Feb 2, 2024	Contrastive LearningObject	—Unverified
SCO-VIST: Social Interaction Commonsense Knowledge-based Visual Storytelling	Feb 1, 2024	DiversityImage Captioning	—Unverified
LCV2: An Efficient Pretraining-Free Framework for Grounded Visual Question Answering	Jan 29, 2024	Language ModelingLanguage Modelling	—Unverified
SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding	Jan 17, 2024	3D visual groundingScene Understanding	—Unverified
Uncovering the Full Potential of Visual Grounding Methods in VQA	Jan 15, 2024	Question AnsweringVisual Grounding	CodeCode Available
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers	Jan 3, 2024	Question AnsweringVisual Grounding	—Unverified
LQMFormer: Language-aware Query Mask Transformer for Referring Image Segmentation	Jan 1, 2024	Image SegmentationSemantic Segmentation	—Unverified
When Visual Grounding Meets Gigapixel-level Large-scale Scenes: Benchmark and Approach	Jan 1, 2024	Scene UnderstandingVisual Grounding	—Unverified
Viewpoint-Aware Visual Grounding in 3D Scenes	Jan 1, 2024	3D visual groundingReferring Expression	—Unverified
Investigating Compositional Challenges in Vision-Language Models for Visual Grounding	Jan 1, 2024	AttributeRelation	CodeCode Available
Multi-Attribute Interactions Matter for 3D Visual Grounding	Jan 1, 2024	3D visual groundingAttribute	CodeCode Available
Towards CLIP-driven Language-free 3D Visual Grounding via 2D-3D Relational Enhancement and Consistency	Jan 1, 2024	3D visual groundingRelation	CodeCode Available
Omni-Q: Omni-Directional Scene Understanding for Unsupervised Visual Grounding	Jan 1, 2024	Scene UnderstandingVisual Grounding	—Unverified
G^3-LQ: Marrying Hyperbolic Alignment with Explicit Semantic-Geometric Modeling for 3D Visual Grounding	Jan 1, 2024	3D visual groundingVisual Grounding	—Unverified
Bridging Modality Gap for Visual Grounding with Effecitve Cross-modal Distillation	Dec 29, 2023	Visual Grounding	—Unverified
Cycle-Consistency Learning for Captioning and Grounding	Dec 23, 2023	Image CaptioningVisual Grounding	—Unverified
Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment	Dec 15, 2023	3D visual groundingNatural Language Queries	—Unverified
Visual Grounding of Whole Radiology Reports for 3D CT Images	Dec 8, 2023	SegmentationVisual Grounding	—Unverified
Improved Visual Grounding through Self-Consistent Explanations	Dec 7, 2023	Language ModellingLarge Language Model	—Unverified
Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment	Dec 5, 2023	Explanation GenerationVisual Grounding	CodeCode Available
Uni3DL: Unified Model for 3D and Language Understanding	Dec 5, 2023	Cross-Modal RetrievalInstance Segmentation	—Unverified
Expand BERT Representation with Visual Information via Grounded Language Learning with Multimodal Partial Alignment	Dec 4, 2023	Grounded language learningLanguage Modeling	—Unverified
G2D: From Global to Dense Radiography Representation Learning via Vision-Language Pre-training	Dec 3, 2023	object-detectionObject Detection	CodeCode Available
Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models	Dec 3, 2023	HallucinationVisual Grounding	CodeCode Available
Context-Aware Indoor Point Cloud Object Generation through User Instructions	Nov 26, 2023	PositionVisual Grounding	—Unverified
Enhancing Visual Grounding and Generalization: A Multi-Task Cycle Training Approach for Vision-Language Models	Nov 21, 2023	Image SegmentationLanguage Modelling	CodeCode Available
A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical Image Analysis	Oct 31, 2023	DescriptiveMedical Image Analysis	—Unverified
GROOViST: A Metric for Grounding Objects in Visual Storytelling	Oct 26, 2023	Visual GroundingVisual Storytelling	CodeCode Available
Context Does Matter: End-to-end Panoptic Narrative Grounding with Deformable Attention Refined Matching Network	Oct 25, 2023	Visual Grounding	CodeCode Available
InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot Interactions	Oct 18, 2023	BenchmarkingVisual Grounding	CodeCode Available

Show:10 25 50

← PrevPage 8 of 12Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified