Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 251–300 of 571 papers

Title	Date	Tasks	Status	Hype
SCO-VIST: Social Interaction Commonsense Knowledge-based Visual Storytelling	Feb 1, 2024	DiversityImage Captioning	—Unverified	0
LCV2: An Efficient Pretraining-Free Framework for Grounded Visual Question Answering	Jan 29, 2024	Language ModelingLanguage Modelling	—Unverified	0
ChatterBox: Multi-round Multimodal Referring and Grounding	Jan 24, 2024	Language ModelingLanguage Modelling	CodeCode Available	2
Unifying Visual and Vision-Language Tracking via Contrastive Learning	Jan 20, 2024	Contrastive LearningObject Tracking	CodeCode Available	1
Veagle: Advancements in Multimodal Representation Learning	Jan 18, 2024	Image CaptioningLanguage Modelling	CodeCode Available	1
SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model	Jan 18, 2024	Instruction FollowingLanguage Modeling	CodeCode Available	2
SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding	Jan 17, 2024	3D visual groundingScene Understanding	—Unverified	0
Uncovering the Full Potential of Visual Grounding Methods in VQA	Jan 15, 2024	Question AnsweringVisual Grounding	CodeCode Available	0
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs	Jan 11, 2024	Representation LearningSelf-Supervised Learning	CodeCode Available	3
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers	Jan 3, 2024	Question AnsweringVisual Grounding	—Unverified	0
Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression Segmentation	Jan 1, 2024	DescriptiveObject	CodeCode Available	2
Investigating Compositional Challenges in Vision-Language Models for Visual Grounding	Jan 1, 2024	AttributeRelation	CodeCode Available	0
Towards CLIP-driven Language-free 3D Visual Grounding via 2D-3D Relational Enhancement and Consistency	Jan 1, 2024	3D visual groundingRelation	CodeCode Available	0
LQMFormer: Language-aware Query Mask Transformer for Referring Image Segmentation	Jan 1, 2024	Image SegmentationSemantic Segmentation	—Unverified	0
G^3-LQ: Marrying Hyperbolic Alignment with Explicit Semantic-Geometric Modeling for 3D Visual Grounding	Jan 1, 2024	3D visual groundingVisual Grounding	—Unverified	0
When Visual Grounding Meets Gigapixel-level Large-scale Scenes: Benchmark and Approach	Jan 1, 2024	Scene UnderstandingVisual Grounding	—Unverified	0
Multi-Attribute Interactions Matter for 3D Visual Grounding	Jan 1, 2024	3D visual groundingAttribute	CodeCode Available	0
Omni-Q: Omni-Directional Scene Understanding for Unsupervised Visual Grounding	Jan 1, 2024	Scene UnderstandingVisual Grounding	—Unverified	0
Viewpoint-Aware Visual Grounding in 3D Scenes	Jan 1, 2024	3D visual groundingReferring Expression	—Unverified	0
V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs	Jan 1, 2024	Visual GroundingWorld Knowledge	CodeCode Available	4
Bridging Modality Gap for Visual Grounding with Effecitve Cross-modal Distillation	Dec 29, 2023	Visual Grounding	—Unverified	0
One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompts	Dec 28, 2023	AllAnatomy	CodeCode Available	2
Cycle-Consistency Learning for Captioning and Grounding	Dec 23, 2023	Image CaptioningVisual Grounding	—Unverified	0
GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection	Dec 22, 2023	Attributeobject-detection	CodeCode Available	1
Mask Grounding for Referring Image Segmentation	Dec 19, 2023	cross-modal alignmentImage Segmentation	CodeCode Available	1
Context Disentangling and Prototype Inheriting for Robust Visual Grounding	Dec 19, 2023	Visual Grounding	CodeCode Available	1
Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment	Dec 15, 2023	3D visual groundingNatural Language Queries	—Unverified	0
Mono3DVG: 3D Visual Grounding in Monocular Images	Dec 13, 2023	3D Object Detection3D visual grounding	CodeCode Available	1
Unveiling Parts Beyond Objects:Towards Finer-Granularity Referring Expression Segmentation	Dec 13, 2023	DescriptiveObject	CodeCode Available	1
Visual Grounding of Whole Radiology Reports for 3D CT Images	Dec 8, 2023	SegmentationVisual Grounding	—Unverified	0
Improved Visual Grounding through Self-Consistent Explanations	Dec 7, 2023	Language ModellingLarge Language Model	—Unverified	0
GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging Cross-Modal Attention with Large Language Models	Dec 6, 2023	Autonomous DrivingAutonomous Vehicles	CodeCode Available	1
Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment	Dec 5, 2023	Explanation GenerationVisual Grounding	CodeCode Available	0
Uni3DL: Unified Model for 3D and Language Understanding	Dec 5, 2023	Cross-Modal RetrievalInstance Segmentation	—Unverified	0
Expand BERT Representation with Visual Information via Grounded Language Learning with Multimodal Partial Alignment	Dec 4, 2023	Grounded language learningLanguage Modeling	—Unverified	0
Aligning and Prompting Everything All at Once for Universal Visual Perception	Dec 4, 2023	AllObject	CodeCode Available	2
Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models	Dec 3, 2023	HallucinationVisual Grounding	CodeCode Available	0
G2D: From Global to Dense Radiography Representation Learning via Vision-Language Pre-training	Dec 3, 2023	object-detectionObject Detection	CodeCode Available	0
Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions	Nov 28, 2023	DisentanglementReferring Expression	CodeCode Available	1
Context-Aware Indoor Point Cloud Object Generation through User Instructions	Nov 26, 2023	PositionVisual Grounding	—Unverified	0
Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding	Nov 26, 2023	3D visual groundingObject	CodeCode Available	1
Enhancing Visual Grounding and Generalization: A Multi-Task Cycle Training Approach for Vision-Language Models	Nov 21, 2023	Image SegmentationLanguage Modelling	CodeCode Available	0
InfMLLM: A Unified Framework for Visual-Language Tasks	Nov 12, 2023	GPUImage Captioning	CodeCode Available	1
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks	Nov 10, 2023	DiversityMulti-Task Learning	CodeCode Available	1
Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in Clutter	Nov 9, 2023	ObjectVisual Grounding	CodeCode Available	1
NExT-Chat: An LMM for Chat, Detection and Segmentation	Nov 8, 2023	Referring ExpressionReferring Expression Segmentation	CodeCode Available	2
GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection	Nov 5, 2023	Anomaly DetectionQuestion Answering	CodeCode Available	1
A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical Image Analysis	Oct 31, 2023	DescriptiveMedical Image Analysis	—Unverified	0
CityRefer: Geography-aware 3D Visual Grounding Dataset on City-scale Point Cloud Data	Oct 28, 2023	3D visual groundingAutonomous Vehicles	CodeCode Available	1
GROOViST: A Metric for Grounding Objects in Visual Storytelling	Oct 26, 2023	Visual GroundingVisual Storytelling	CodeCode Available	0

Show:10 25 50

← PrevPage 6 of 12Next →

All datasets RefCOCO testA RefCOCO+ test B RefCoCo val

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	95.3	—	Unverified
2	mPLUG-2	Accuracy (%)	92.8	—	Unverified
3	X2-VLM (large)	Accuracy (%)	92.1	—	Unverified
4	XFM (base)	Accuracy (%)	90.4	—	Unverified
5	X2-VLM (base)	Accuracy (%)	90.3	—	Unverified
6	X-VLM (base)	Accuracy (%)	89	—	Unverified
7	HYDRA	IoU	61.7	—	Unverified
8	HYDRA	IoU	61.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	92	—	Unverified
2	mPLUG-2	Accuracy (%)	86.05	—	Unverified
3	X2-VLM (large)	Accuracy (%)	81.8	—	Unverified
4	XFM (base)	Accuracy (%)	79.8	—	Unverified
5	X2-VLM (base)	Accuracy (%)	78.4	—	Unverified
6	X-VLM (base)	Accuracy (%)	76.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Florence-2-large-ft	Accuracy (%)	93.4	—	Unverified
2	mPLUG-2	Accuracy (%)	90.33	—	Unverified
3	X2-VLM (large)	Accuracy (%)	87.6	—	Unverified
4	XFM (base)	Accuracy (%)	86.1	—	Unverified
5	X2-VLM (base)	Accuracy (%)	85.2	—	Unverified
6	X-VLM (base)	Accuracy (%)	84.51	—	Unverified