SOTAVerified

Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

  • What is the main focus in a query?
  • How to understand an image?
  • How to locate an object?

Papers

Showing 351375 of 571 papers

TitleStatusHype
Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention0
LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding0
Talk to Parallel LiDARs: A Human-LiDAR Interaction Method Based on 3D Visual Grounding0
Adversarial Robustness for Visual Grounding of Multimodal Large Language ModelsCode0
Visual grounding for desktop graphical user interfaces0
Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners0
BlenderAlchemy: Editing 3D Graphics with Vision-Language Models0
Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based LocalizationCode0
MedRG: Medical Report Grounding with Multi-modal Large Language Model0
Data-Efficient 3D Visual Grounding via Order-Aware Referring0
Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery0
VidLA: Video-Language Alignment at Scale0
Learning from Synthetic Data for Visual Grounding0
WaterVG: Waterway Visual Grounding based on Text-Guided Vision and mmWave Radar0
Right Place, Right Time! Dynamizing Topological Graphs for Embodied Navigation0
SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph AttentionCode0
Detecting Concrete Visual Tokens for Multimodal Machine Translation0
Adversarial Testing for Visual Grounding via Image-Aware Property Reduction0
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web0
ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward ModelingCode0
Neural Slot Interpreters: Grounding Object Semantics in Emergent Slot Representations0
SCO-VIST: Social Interaction Commonsense Knowledge-based Visual Storytelling0
LCV2: An Efficient Pretraining-Free Framework for Grounded Visual Question Answering0
SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding0
Uncovering the Full Potential of Visual Grounding Methods in VQACode0
Show:102550
← PrevPage 15 of 23Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)95.3Unverified
2mPLUG-2Accuracy (%)92.8Unverified
3X2-VLM (large)Accuracy (%)92.1Unverified
4XFM (base)Accuracy (%)90.4Unverified
5X2-VLM (base)Accuracy (%)90.3Unverified
6X-VLM (base)Accuracy (%)89Unverified
7HYDRAIoU61.7Unverified
8HYDRAIoU61.1Unverified
#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)92Unverified
2mPLUG-2Accuracy (%)86.05Unverified
3X2-VLM (large)Accuracy (%)81.8Unverified
4XFM (base)Accuracy (%)79.8Unverified
5X2-VLM (base)Accuracy (%)78.4Unverified
6X-VLM (base)Accuracy (%)76.91Unverified
#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)93.4Unverified
2mPLUG-2Accuracy (%)90.33Unverified
3X2-VLM (large)Accuracy (%)87.6Unverified
4XFM (base)Accuracy (%)86.1Unverified
5X2-VLM (base)Accuracy (%)85.2Unverified
6X-VLM (base)Accuracy (%)84.51Unverified