SOTAVerified

Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

  • What is the main focus in a query?
  • How to understand an image?
  • How to locate an object?

Papers

Showing 551571 of 571 papers

TitleStatusHype
Visual Coreference Resolution in Visual Dialog using Neural Module NetworksCode0
Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining0
Illustrative Language Understanding: Large-Scale Visual Grounding with Image Search0
Visually grounded cross-lingual keyword spotting in speech0
Interactive Visual Grounding of Referring Expressions for Human-Robot Interaction0
Finding "It": Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos0
Visual Grounding via Accumulated Attention0
Rethinking Diversified and Discriminative Proposal Generation for Visual GroundingCode0
Finding beans in burgers: Deep semantic-visual embedding with localizationCode0
Learning Unsupervised Visual Grounding Through Semantic Self-Supervision0
Interactive Reinforcement Learning for Object Grounding via Self-Talking0
Improving Visually Grounded Sentence Representations with Self-Attention0
Self-view Grounding Given a Narrated 360° VideoCode0
Visual Reference Resolution using Attention Memory for Visual Dialog0
Weakly-supervised Visual Grounding of Phrases with Linguistic Structures0
Learning Two-Branch Neural Networks for Image-Text Matching TasksCode0
Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation0
Revisiting Visual Question Answering BaselinesCode0
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual GroundingCode0
Visual Word2Vec (vis-w2v): Learning Visually Grounded Word Embeddings Using Abstract ScenesCode0
Grounding of Textual Phrases in Images by ReconstructionCode0
Show:102550
← PrevPage 12 of 12Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)95.3Unverified
2mPLUG-2Accuracy (%)92.8Unverified
3X2-VLM (large)Accuracy (%)92.1Unverified
4XFM (base)Accuracy (%)90.4Unverified
5X2-VLM (base)Accuracy (%)90.3Unverified
6X-VLM (base)Accuracy (%)89Unverified
7HYDRAIoU61.7Unverified
8HYDRAIoU61.1Unverified
#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)92Unverified
2mPLUG-2Accuracy (%)86.05Unverified
3X2-VLM (large)Accuracy (%)81.8Unverified
4XFM (base)Accuracy (%)79.8Unverified
5X2-VLM (base)Accuracy (%)78.4Unverified
6X-VLM (base)Accuracy (%)76.91Unverified
#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)93.4Unverified
2mPLUG-2Accuracy (%)90.33Unverified
3X2-VLM (large)Accuracy (%)87.6Unverified
4XFM (base)Accuracy (%)86.1Unverified
5X2-VLM (base)Accuracy (%)85.2Unverified
6X-VLM (base)Accuracy (%)84.51Unverified