SOTAVerified

Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

  • What is the main focus in a query?
  • How to understand an image?
  • How to locate an object?

Papers

Showing 376400 of 571 papers

TitleStatusHype
Joint Visual Grounding with Language Scene Graphs0
Fast visual grounding in interaction: bringing few-shot learning with neural networks to an interactive robot0
Referring to Screen Texts with Voice Assistants0
FACET: Fairness in Computer Vision Evaluation Benchmark0
Exploring Context, Attention and Audio Features for Audio Visual Scene-Aware Dialog0
Explainable Video Entailment With Grounded Visual Evidence0
Learning to Assemble Neural Module Tree Networks for Visual Grounding0
AIFit: Automatic 3D Human-Interpretable Feedback Models for Fitness Training0
VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation0
ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue0
Retrieve, Caption, Generate: Visual Grounding for Enhancing Commonsense in Text Generation Models0
Revisiting Data Auditing in Large Vision-Language Models0
Revisiting Visual Grounding0
AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations0
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling0
Expand BERT Representation with Visual Information via Grounded Language Learning with Multimodal Partial Alignment0
Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions0
Right Place, Right Time! Dynamizing Topological Graphs for Embodied Navigation0
Extending CLIP's Image-Text Alignment to Referring Image Segmentation0
RLS3: RL-Based Synthetic Sample Selection to Enhance Spatial Reasoning in Vision-Language Models for Indoor Autonomous Perception0
RoViST: Learning Robust Metrics for Visual Storytelling0
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks0
VLMAE: Vision-Language Masked Autoencoder0
RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data0
RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought0
Show:102550
← PrevPage 16 of 23Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)95.3Unverified
2mPLUG-2Accuracy (%)92.8Unverified
3X2-VLM (large)Accuracy (%)92.1Unverified
4XFM (base)Accuracy (%)90.4Unverified
5X2-VLM (base)Accuracy (%)90.3Unverified
6X-VLM (base)Accuracy (%)89Unverified
7HYDRAIoU61.7Unverified
8HYDRAIoU61.1Unverified
#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)92Unverified
2mPLUG-2Accuracy (%)86.05Unverified
3X2-VLM (large)Accuracy (%)81.8Unverified
4XFM (base)Accuracy (%)79.8Unverified
5X2-VLM (base)Accuracy (%)78.4Unverified
6X-VLM (base)Accuracy (%)76.91Unverified
#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)93.4Unverified
2mPLUG-2Accuracy (%)90.33Unverified
3X2-VLM (large)Accuracy (%)87.6Unverified
4XFM (base)Accuracy (%)86.1Unverified
5X2-VLM (base)Accuracy (%)85.2Unverified
6X-VLM (base)Accuracy (%)84.51Unverified