SOTAVerified

Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

  • What is the main focus in a query?
  • How to understand an image?
  • How to locate an object?

Papers

Showing 251275 of 571 papers

TitleStatusHype
Towards Visual Text Grounding of Multimodal Large Language Model0
Multimodal Reference Visual Grounding0
Image Difference Grounding with Natural Language0
Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target GranularitiesCode0
MB-ORES: A Multi-Branch Object Reasoner for Visual Grounding in Remote SensingCode0
ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning0
Efficient Adaptation For Remote Sensing Visual Grounding0
NuGrounding: A Multi-View 3D Visual Grounding Framework in Autonomous Driving0
Beyond Object Categories: Multi-Attribute Reference Understanding for Visual Grounding0
Seeing Speech and Sound: Distinguishing and Locating Audios in Visual Scenes0
A Vision Centric Remote Sensing Benchmark0
LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data GenerationCode0
Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding0
Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions0
Teaching Metric Distance to Autoregressive Multimodal Foundational Models0
Structured Preference Optimization for Vision-Language Long-Horizon Task Planning0
ProxyTransformation: Preshaping Point Cloud Manifold With Proxy Attention For 3D Visual Grounding0
Programming with Pixels: Computer-Use Meets Software Engineering0
GroundCap: A Visually Grounded Image Captioning Dataset0
Leveraging Multimodal-LLMs Assisted by Instance Segmentation for Intelligent Traffic Monitoring0
TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation0
RLS3: RL-Based Synthetic Sample Selection to Enhance Spatial Reasoning in Vision-Language Models for Indoor Autonomous Perception0
ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations0
FLORA: Formal Language Model Enables Robust Training-free Zero-shot Object Referring Analysis0
AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring0
Show:102550
← PrevPage 11 of 23Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)95.3Unverified
2mPLUG-2Accuracy (%)92.8Unverified
3X2-VLM (large)Accuracy (%)92.1Unverified
4XFM (base)Accuracy (%)90.4Unverified
5X2-VLM (base)Accuracy (%)90.3Unverified
6X-VLM (base)Accuracy (%)89Unverified
7HYDRAIoU61.7Unverified
8HYDRAIoU61.1Unverified
#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)92Unverified
2mPLUG-2Accuracy (%)86.05Unverified
3X2-VLM (large)Accuracy (%)81.8Unverified
4XFM (base)Accuracy (%)79.8Unverified
5X2-VLM (base)Accuracy (%)78.4Unverified
6X-VLM (base)Accuracy (%)76.91Unverified
#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)93.4Unverified
2mPLUG-2Accuracy (%)90.33Unverified
3X2-VLM (large)Accuracy (%)87.6Unverified
4XFM (base)Accuracy (%)86.1Unverified
5X2-VLM (base)Accuracy (%)85.2Unverified
6X-VLM (base)Accuracy (%)84.51Unverified