SOTAVerified

Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

  • What is the main focus in a query?
  • How to understand an image?
  • How to locate an object?

Papers

Showing 501525 of 571 papers

TitleStatusHype
Propagating Over Phrase Relations for One-Stage Visual Grounding0
Spatially Aware Multimodal Transformers for TextVQACode1
Visual Relation Grounding in VideosCode1
Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder0
Multi-Granularity Modularized Network for Abstract Visual Reasoning0
Improving Weakly Supervised Visual Grounding by Contrastive Knowledge DistillationCode1
Knowledge Supports Visual Language Grounding: A Case Study on Colour Terms0
Fast visual grounding in interaction: bringing few-shot learning with neural networks to an interactive robot0
Visual Grounding Annotation of Recipe Flow Graph0
Visual Grounding of Learned Physical ModelsCode1
Deep Multimodal Neural Architecture SearchCode1
Visual Grounding Methods for VQA are Working for the Wrong Reasons!Code1
Spatio-Temporal Graph for Video Captioning with Knowledge Distillation0
Giving Commands to a Self-driving Car: A Multimodal Reasoner for Visual Grounding0
Visual Grounding in Video for Unsupervised Word TranslationCode1
Guessing State Tracking for Visual DialogueCode1
Emergent Communication with World Models0
Learning Cross-modal Context Graph for Visual GroundingCode1
Exploring Context, Attention and Audio Features for Audio Visual Scene-Aware Dialog0
Connecting Vision and Language with Localized NarrativesCode0
Compositional Temporal Visual Grounding of Natural Language Event Descriptions0
OptiBox: Breaking the Limits of Proposals for Visual Grounding0
Learning Cross-modal Context Graph for Visual GroundingCode1
Leveraging Past References for Robust Language Grounding0
Countering Language Drift via Visual Grounding0
Show:102550
← PrevPage 21 of 23Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)95.3Unverified
2mPLUG-2Accuracy (%)92.8Unverified
3X2-VLM (large)Accuracy (%)92.1Unverified
4XFM (base)Accuracy (%)90.4Unverified
5X2-VLM (base)Accuracy (%)90.3Unverified
6X-VLM (base)Accuracy (%)89Unverified
7HYDRAIoU61.7Unverified
8HYDRAIoU61.1Unverified
#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)92Unverified
2mPLUG-2Accuracy (%)86.05Unverified
3X2-VLM (large)Accuracy (%)81.8Unverified
4XFM (base)Accuracy (%)79.8Unverified
5X2-VLM (base)Accuracy (%)78.4Unverified
6X-VLM (base)Accuracy (%)76.91Unverified
#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)93.4Unverified
2mPLUG-2Accuracy (%)90.33Unverified
3X2-VLM (large)Accuracy (%)87.6Unverified
4XFM (base)Accuracy (%)86.1Unverified
5X2-VLM (base)Accuracy (%)85.2Unverified
6X-VLM (base)Accuracy (%)84.51Unverified