SOTAVerified

Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

  • What is the main focus in a query?
  • How to understand an image?
  • How to locate an object?

Papers

Showing 251300 of 571 papers

TitleStatusHype
Differentiable Disentanglement Filter: an Application Agnostic Core Concept Discovery Probe0
Differentiable Parsing and Visual Grounding of Natural Language Instructions for Object Placement0
Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs0
Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation0
Data-Efficient 3D Visual Grounding via Order-Aware Referring0
DSM: Building A Diverse Semantic Map for 3D Visual Grounding0
Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding0
Dynamic Inference With Grounding Based Vision and Language Models0
Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding0
EAGLE: Enhanced Visual Grounding Minimizes Hallucinations in Instructional Multimodal Models0
EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues0
EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments0
Efficient Adaptation For Remote Sensing Visual Grounding0
Efficient Multi-Modal Embeddings from Structured Data0
Emergent Communication with World Models0
ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities0
Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions0
Expand BERT Representation with Visual Information via Grounded Language Learning with Multimodal Partial Alignment0
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling0
Learning to Assemble Neural Module Tree Networks for Visual Grounding0
Explainable Video Entailment With Grounded Visual Evidence0
FACET: Fairness in Computer Vision Evaluation Benchmark0
Fast visual grounding in interaction: bringing few-shot learning with neural networks to an interactive robot0
Few-Shot Visual Grounding for Natural Human-Robot Interaction0
Finding "It": Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos0
FindIt: Generalized Localization with Natural Language Queries0
Fine-Grained Spatial and Verbal Losses for 3D Visual Grounding0
FLORA: Formal Language Model Enables Robust Training-free Zero-shot Object Referring Analysis0
FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts0
Focusing On Targets For Improving Weakly Supervised Visual Grounding0
From Local Concepts to Universals: Evaluating the Multicultural Understanding of Vision-Language Models0
From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes0
G^3-LQ: Marrying Hyperbolic Alignment with Explicit Semantic-Geometric Modeling for 3D Visual Grounding0
GAFNet: A Global Fourier Self Attention Based Novel Network for multi-modal downstream tasks0
GAGS: Granularity-Aware Feature Distillation for Language Gaussian Splatting0
GEMeX-ThinkVG: Towards Thinking with Visual Grounding in Medical VQA via Reinforcement Learning0
GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing0
Giving Commands to a Self-driving Car: A Multimodal Reasoner for Visual Grounding0
Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models0
GroundCap: A Visually Grounded Image Captioning Dataset0
GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding0
GRAPPA: Generalizing and Adapting Robot Policies via Online Agentic Guidance0
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents0
Guiding Visual Question Answering with Attention Priors0
HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation0
HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model0
HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task0
Illustrative Language Understanding: Large-Scale Visual Grounding with Image Search0
Image Difference Grounding with Natural Language0
Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation0
Show:102550
← PrevPage 6 of 12Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)95.3Unverified
2mPLUG-2Accuracy (%)92.8Unverified
3X2-VLM (large)Accuracy (%)92.1Unverified
4XFM (base)Accuracy (%)90.4Unverified
5X2-VLM (base)Accuracy (%)90.3Unverified
6X-VLM (base)Accuracy (%)89Unverified
7HYDRAIoU61.7Unverified
8HYDRAIoU61.1Unverified
#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)92Unverified
2mPLUG-2Accuracy (%)86.05Unverified
3X2-VLM (large)Accuracy (%)81.8Unverified
4XFM (base)Accuracy (%)79.8Unverified
5X2-VLM (base)Accuracy (%)78.4Unverified
6X-VLM (base)Accuracy (%)76.91Unverified
#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)93.4Unverified
2mPLUG-2Accuracy (%)90.33Unverified
3X2-VLM (large)Accuracy (%)87.6Unverified
4XFM (base)Accuracy (%)86.1Unverified
5X2-VLM (base)Accuracy (%)85.2Unverified
6X-VLM (base)Accuracy (%)84.51Unverified