Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 301–350 of 571 papers

Title	Date	Tasks	Status	Hype
Context Does Matter: End-to-end Panoptic Narrative Grounding with Deformable Attention Refined Matching Network	Oct 25, 2023	Visual Grounding	CodeCode Available	0
OV-VG: A Benchmark for Open-Vocabulary Visual Grounding	Oct 22, 2023	Novel Conceptsobject-detection	CodeCode Available	1
Visual Grounding Helps Learn Word Meanings in Low-Data Regimes	Oct 20, 2023	Image CaptioningLanguage Acquisition	CodeCode Available	1
InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot Interactions	Oct 18, 2023	BenchmarkingVisual Grounding	CodeCode Available	0
NICE: Improving Panoptic Narrative Detection and Segmentation with Cascading Collaborative Learning	Oct 17, 2023	SegmentationVisual Grounding	CodeCode Available	0
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V	Oct 17, 2023	Interactive SegmentationReferring Expression	CodeCode Available	4
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning	Oct 14, 2023	Image ClassificationImage Description	CodeCode Available	7
From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models	Oct 13, 2023	HallucinationImage Captioning	CodeCode Available	2
CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding	Oct 10, 2023	3D visual groundingVisual Grounding	CodeCode Available	1
Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models	Oct 9, 2023	Language ModellingQuestion Answering	CodeCode Available	1
Lightweight In-Context Tuning for Multimodal Unified Models	Oct 8, 2023	Image CaptioningIn-Context Learning	—Unverified	0
LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent	Sep 21, 2023	3D visual groundingLanguage Modeling	CodeCode Available	2
Object2Scene: Putting Objects in Context for Open-Vocabulary 3D Detection	Sep 18, 2023	3D Object Detection3D Open-Vocabulary Object Detection	—Unverified	0
PROGrasp: Pragmatic Human-Robot Communication for Object Grasping	Sep 14, 2023	ObjectObject Discovery	CodeCode Available	1
Multi3DRefer: Grounding Text Description to Multiple 3D Objects	Sep 11, 2023	3D visual groundingContrastive Learning	CodeCode Available	1
Collecting Visually-Grounded Dialogue with A Game Of Sorts	Sep 10, 2023	Coreference ResolutionImage Retrieval	CodeCode Available	0
Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding	Sep 8, 2023	3D Instance Segmentation3D visual grounding	—Unverified	0
Interpretable Visual Question Answering via Reasoning Supervision	Sep 7, 2023	Common Sense ReasoningQuestion Answering	—Unverified	0
DetermiNet: A Large-Scale Diagnostic Dataset for Complex Visually-Grounded Referencing using Determiners	Sep 7, 2023	DiagnosticVisual Grounding	CodeCode Available	0
VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders	Sep 3, 2023	Visual Grounding	CodeCode Available	1
FACET: Fairness in Computer Vision Evaluation Benchmark	Aug 31, 2023	Fairnessimage-classification	—Unverified	0
WALL-E: Embodied Robotic WAiter Load Lifting with Large Language Model	Aug 30, 2023	Language ModelingLanguage Modelling	—Unverified	0
UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory	Aug 28, 2023	Question AnsweringRetrieval	CodeCode Available	1
HuBo-VLM: Unified Vision-Language Model designed for HUman roBOt interaction tasks	Aug 24, 2023	Language ModelingLanguage Modelling	CodeCode Available	0
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond	Aug 24, 2023	Chart Question AnsweringFS-MEVQA	CodeCode Available	5
A Unified Framework for 3D Point Cloud Visual Grounding	Aug 23, 2023	CPUGPU	CodeCode Available	1
Target-Grounded Graph-Aware Transformer for Aerial Vision-and-Dialog Navigation	Aug 22, 2023	Visual Grounding	CodeCode Available	1
Language-Guided Diffusion Model for Visual Grounding	Aug 18, 2023	cross-modal alignmentDenoising	CodeCode Available	0
3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment	Aug 8, 2023	3D Question Answering (3D-QA)Dense Captioning	CodeCode Available	2
3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding	Jul 25, 2023	3D visual groundingObject	—Unverified	0
Iterative Robust Visual Grounding with Masked Reference based Centerpoint Supervision	Jul 23, 2023	DecoderVisual Grounding	CodeCode Available	1
Advancing Visual Grounding with Scene Knowledge: Benchmark and Method	Jul 21, 2023	Image-text matchingText Matching	CodeCode Available	1
Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding	Jul 18, 2023	3D visual groundingObject	CodeCode Available	1
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs	Jul 17, 2023	Instruction FollowingSentence	CodeCode Available	2
GVCCI: Lifelong Learning of Visual Grounding for Language-Guided Robotic Manipulation	Jul 12, 2023	Lifelong learningObject Detection	CodeCode Available	0
OG: Equip vision occupancy with instance segmentation and visual grounding	Jul 12, 2023	Instance SegmentationSegmentation	—Unverified	0
What Do Self-Supervised Speech Models Know About Words?	Jun 30, 2023	SentenceSentence Similarity	CodeCode Available	1
Learning with Difference Attention for Visually Grounded Self-supervised Representations	Jun 26, 2023	Self-Supervised LearningVisual Grounding	—Unverified	0
Kosmos-2: Grounding Multimodal Large Language Models to the World	Jun 26, 2023	Image CaptioningIn-Context Learning	CodeCode Available	1
Extending CLIP's Image-Text Alignment to Referring Image Segmentation	Jun 14, 2023	Image SegmentationReferring Expression Segmentation	—Unverified	0
Referring to Screen Texts with Voice Assistants	Jun 10, 2023	NavigateVisual Grounding	—Unverified	0
Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards	Jun 7, 2023	DiversityImage Captioning	CodeCode Available	1
Language Adaptive Weight Generation for Multi-task Visual Grounding	Jun 6, 2023	Referring ExpressionReferring Expression Comprehension	CodeCode Available	0
Leverage Points in Modality Shifts: Comparing Language-only and Multimodal Word Representations	Jun 4, 2023	Visual GroundingWord Embeddings	CodeCode Available	0
Benchmarking Diverse-Modal Entity Linking with Generative Models	May 27, 2023	BenchmarkingDecoder	—Unverified	0
Language-Guided 3D Object Detection in Point Cloud for Autonomous Driving	May 25, 2023	3D Object DetectionAutonomous Driving	—Unverified	0
Measuring Faithful and Plausible Visual Grounding in VQA	May 24, 2023	Question AnsweringVisual Grounding	CodeCode Available	0
An Examination of the Robustness of Reference-Free Image Captioning Evaluation Metrics	May 24, 2023	Image CaptioningNegation	CodeCode Available	0
Cross3DVG: Cross-Dataset 3D Visual Grounding on Different RGB-D Scans	May 23, 2023	3D Reconstruction3D visual grounding	CodeCode Available	1
Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model	May 19, 2023	Language ModelingLanguage Modelling	CodeCode Available	1

Show:10 25 50

← PrevPage 7 of 12Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified