SOTAVerified

Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

  • What is the main focus in a query?
  • How to understand an image?
  • How to locate an object?

Papers

Showing 351400 of 571 papers

TitleStatusHype
From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes0
From Local Concepts to Universals: Evaluating the Multicultural Understanding of Vision-Language Models0
Parallel Vertex Diffusion for Unified Visual Grounding0
Parameter-Efficient Fine-Tuning Medical Multimodal Large Language Models for Medical Visual Grounding0
PD-APE: A Parallel Decoding Framework with Adaptive Position Encoding for 3D Visual Grounding0
Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning0
Visual Prompting in Multimodal Large Language Models: A Survey0
Context-Aware Indoor Point Cloud Object Generation through User Instructions0
Polaris: Open-ended Interactive Robotic Manipulation via Syn2Real Visual Grounding and Large Language Models0
Focusing On Targets For Improving Weakly Supervised Visual Grounding0
Programming with Pixels: Computer-Use Meets Software Engineering0
FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts0
Visual Reference Resolution using Attention Memory for Visual Dialog0
Propagating Over Phrase Relations for One-Stage Visual Grounding0
ProxyTransformation: Preshaping Point Cloud Manifold With Proxy Attention For 3D Visual Grounding0
FLORA: Formal Language Model Enables Robust Training-free Zero-shot Object Referring Analysis0
Fine-Grained Spatial and Verbal Losses for 3D Visual Grounding0
ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning0
FindIt: Generalized Localization with Natural Language Queries0
Redemption Score: An Evaluation Framework to Rank Image Captions While Redeeming Image Semantics and Language Pragmatics0
Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder0
Finding "It": Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos0
ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations0
Referencing Where to Focus: Improving VisualGrounding with Referential Query0
Few-Shot Visual Grounding for Natural Human-Robot Interaction0
Joint Visual Grounding with Language Scene Graphs0
Fast visual grounding in interaction: bringing few-shot learning with neural networks to an interactive robot0
Referring to Screen Texts with Voice Assistants0
FACET: Fairness in Computer Vision Evaluation Benchmark0
Exploring Context, Attention and Audio Features for Audio Visual Scene-Aware Dialog0
Explainable Video Entailment With Grounded Visual Evidence0
Learning to Assemble Neural Module Tree Networks for Visual Grounding0
AIFit: Automatic 3D Human-Interpretable Feedback Models for Fitness Training0
VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation0
ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue0
Retrieve, Caption, Generate: Visual Grounding for Enhancing Commonsense in Text Generation Models0
Revisiting Data Auditing in Large Vision-Language Models0
Revisiting Visual Grounding0
AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations0
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling0
Expand BERT Representation with Visual Information via Grounded Language Learning with Multimodal Partial Alignment0
Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions0
Right Place, Right Time! Dynamizing Topological Graphs for Embodied Navigation0
Extending CLIP's Image-Text Alignment to Referring Image Segmentation0
RLS3: RL-Based Synthetic Sample Selection to Enhance Spatial Reasoning in Vision-Language Models for Indoor Autonomous Perception0
RoViST: Learning Robust Metrics for Visual Storytelling0
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks0
VLMAE: Vision-Language Masked Autoencoder0
RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data0
RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought0
Show:102550
← PrevPage 8 of 12Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)95.3Unverified
2mPLUG-2Accuracy (%)92.8Unverified
3X2-VLM (large)Accuracy (%)92.1Unverified
4XFM (base)Accuracy (%)90.4Unverified
5X2-VLM (base)Accuracy (%)90.3Unverified
6X-VLM (base)Accuracy (%)89Unverified
7HYDRAIoU61.7Unverified
8HYDRAIoU61.1Unverified
#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)92Unverified
2mPLUG-2Accuracy (%)86.05Unverified
3X2-VLM (large)Accuracy (%)81.8Unverified
4XFM (base)Accuracy (%)79.8Unverified
5X2-VLM (base)Accuracy (%)78.4Unverified
6X-VLM (base)Accuracy (%)76.91Unverified
#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)93.4Unverified
2mPLUG-2Accuracy (%)90.33Unverified
3X2-VLM (large)Accuracy (%)87.6Unverified
4XFM (base)Accuracy (%)86.1Unverified
5X2-VLM (base)Accuracy (%)85.2Unverified
6X-VLM (base)Accuracy (%)84.51Unverified