Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 401–450 of 571 papers

Title	Date	Tasks	Status
NICE: Improving Panoptic Narrative Detection and Segmentation with Cascading Collaborative Learning	Oct 17, 2023	SegmentationVisual Grounding	CodeCode Available
Lightweight In-Context Tuning for Multimodal Unified Models	Oct 8, 2023	Image CaptioningIn-Context Learning	—Unverified
Object2Scene: Putting Objects in Context for Open-Vocabulary 3D Detection	Sep 18, 2023	3D Object Detection3D Open-Vocabulary Object Detection	—Unverified
Collecting Visually-Grounded Dialogue with A Game Of Sorts	Sep 10, 2023	Coreference ResolutionImage Retrieval	CodeCode Available
Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding	Sep 8, 2023	3D Instance Segmentation3D visual grounding	—Unverified
DetermiNet: A Large-Scale Diagnostic Dataset for Complex Visually-Grounded Referencing using Determiners	Sep 7, 2023	DiagnosticVisual Grounding	CodeCode Available
Interpretable Visual Question Answering via Reasoning Supervision	Sep 7, 2023	Common Sense ReasoningQuestion Answering	—Unverified
FACET: Fairness in Computer Vision Evaluation Benchmark	Aug 31, 2023	Fairnessimage-classification	—Unverified
WALL-E: Embodied Robotic WAiter Load Lifting with Large Language Model	Aug 30, 2023	Language ModelingLanguage Modelling	—Unverified
HuBo-VLM: Unified Vision-Language Model designed for HUman roBOt interaction tasks	Aug 24, 2023	Language ModelingLanguage Modelling	CodeCode Available
Language-Guided Diffusion Model for Visual Grounding	Aug 18, 2023	cross-modal alignmentDenoising	CodeCode Available
3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding	Jul 25, 2023	3D visual groundingObject	—Unverified
GVCCI: Lifelong Learning of Visual Grounding for Language-Guided Robotic Manipulation	Jul 12, 2023	Lifelong learningObject Detection	CodeCode Available
OG: Equip vision occupancy with instance segmentation and visual grounding	Jul 12, 2023	Instance SegmentationSegmentation	—Unverified
Learning with Difference Attention for Visually Grounded Self-supervised Representations	Jun 26, 2023	Self-Supervised LearningVisual Grounding	—Unverified
Extending CLIP's Image-Text Alignment to Referring Image Segmentation	Jun 14, 2023	Image SegmentationReferring Expression Segmentation	—Unverified
Referring to Screen Texts with Voice Assistants	Jun 10, 2023	NavigateVisual Grounding	—Unverified
Language Adaptive Weight Generation for Multi-task Visual Grounding	Jun 6, 2023	Referring ExpressionReferring Expression Comprehension	CodeCode Available
Leverage Points in Modality Shifts: Comparing Language-only and Multimodal Word Representations	Jun 4, 2023	Visual GroundingWord Embeddings	CodeCode Available
Benchmarking Diverse-Modal Entity Linking with Generative Models	May 27, 2023	BenchmarkingDecoder	—Unverified
Language-Guided 3D Object Detection in Point Cloud for Autonomous Driving	May 25, 2023	3D Object DetectionAutonomous Driving	—Unverified
Measuring Faithful and Plausible Visual Grounding in VQA	May 24, 2023	Question AnsweringVisual Grounding	CodeCode Available
An Examination of the Robustness of Reference-Free Image Captioning Evaluation Metrics	May 24, 2023	Image CaptioningNegation	CodeCode Available
TreePrompt: Learning to Compose Tree Prompts for Explainable Visual Grounding	May 19, 2023	SentenceVisual Grounding	—Unverified
Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding	May 18, 2023	Contrastive LearningObject	—Unverified
Sample-Specific Debiasing for Better Image-Text Models	Apr 25, 2023	Contrastive LearningCross-Modal Retrieval	—Unverified
Movie Box Office Prediction With Self-Supervised and Visually Grounded Pretraining	Apr 20, 2023	Visual Grounding	—Unverified
WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language	Apr 12, 2023	3D visual groundingAutonomous Driving	CodeCode Available
ScanERU: Interactive 3D Visual Grounding based on Embodied Reference Understanding	Mar 23, 2023	3D visual groundingVisual Grounding	CodeCode Available
Medical Phrase Grounding with Region-Phrase Context Contrastive Alignment	Mar 14, 2023	Medical Image AnalysisPhrase Grounding	—Unverified
Parallel Vertex Diffusion for Unified Visual Grounding	Mar 13, 2023	Visual Grounding	—Unverified
Focusing On Targets For Improving Weakly Supervised Visual Grounding	Feb 22, 2023	Dependency ParsingObject	—Unverified
Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks	Jan 12, 2023	Cross-Modal RetrievalOpen-Ended Question Answering	CodeCode Available
ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding	Jan 1, 2023	3D visual groundingVisual Grounding	—Unverified
CoSign: Exploring Co-occurrence Signals in Skeleton-based Continuous Sign Language Recognition	Jan 1, 2023	Sign Language RecognitionVisual Grounding	—Unverified
Dynamic Inference With Grounding Based Vision and Language Models	Jan 1, 2023	Language ModellingReferring Expression	—Unverified
GAFNet: A Global Fourier Self Attention Based Novel Network for multi-modal downstream tasks	Jan 1, 2023	Image GenerationImage-text Retrieval	—Unverified
Using Multiple Instance Learning to Build Multimodal Representations	Dec 11, 2022	Contrastive LearningCross-Modal Retrieval	—Unverified
UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding	Dec 1, 2022	3D dense captioning3D visual grounding	—Unverified
MNER-QG: An End-to-End MRC framework for Multimodal Named Entity Recognition with Query Grounding	Nov 27, 2022	named-entity-recognitionNamed Entity Recognition	—Unverified
A survey on knowledge-enhanced multimodal learning	Nov 19, 2022	Conditional Image GenerationFactual Visual Question Answering	—Unverified
Visually Grounded VQA by Lattice-based Retrieval	Nov 15, 2022	Information RetrievalQuestion Answering	CodeCode Available
Are Current Decoding Strategies Capable of Facing the Challenges of Visual Dialogue?	Oct 24, 2022	InformativenessText Generation	—Unverified
RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data	Oct 23, 2022	Image CaptioningImage-text Retrieval	—Unverified
A Visual Tour Of Current Challenges In Multimodal Language Models	Oct 22, 2022	Image GenerationText to Image Generation	—Unverified
Like a bilingual baby: The advantage of visually grounding a bilingual language model	Oct 11, 2022	Language ModelingLanguage Modelling	—Unverified
YFACC: A Yorùbá speech-image dataset for cross-lingual keyword localisation through visual grounding	Oct 10, 2022	Visual Grounding	—Unverified
MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning	Oct 9, 2022	Image-text Retrievalmultimodal interaction	—Unverified
Enhancing Interpretability and Interactivity in Robot Manipulation: A Neurosymbolic Approach	Oct 3, 2022	Referring ExpressionRobot Manipulation	CodeCode Available
Differentiable Parsing and Visual Grounding of Natural Language Instructions for Object Placement	Oct 1, 2022	Graph Neural NetworkObject	—Unverified

Show:10 25 50

← PrevPage 9 of 12Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified