SOTAVerified

Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

  • What is the main focus in a query?
  • How to understand an image?
  • How to locate an object?

Papers

Showing 501550 of 571 papers

TitleStatusHype
Answer Questions with Right Image Regions: A Visual Attention Regularization ApproachCode0
Transformers in Vision: A Survey0
3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds0
Explainable Video Entailment With Grounded Visual Evidence0
CASTing Your Model: Learning to Localize Improves Self-Supervised Representations0
Class-agnostic Object Detection0
Learning to ground medical text in a 3D human atlasCode0
SOrT-ing VQA Models : Contrastive Gradient Learning for Improved ConsistencyCode0
Neural Twins TalkCode0
Commands 4 Autonomous Vehicles (C4AV) Workshop Summary0
Cosine meets Softmax: A tough-to-beat baseline for visual groundingCode0
AttnGrounder: Talking to Cars with AttentionCode0
Propagating Over Phrase Relations for One-Stage Visual Grounding0
Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder0
Multi-Granularity Modularized Network for Abstract Visual Reasoning0
Knowledge Supports Visual Language Grounding: A Case Study on Colour Terms0
Fast visual grounding in interaction: bringing few-shot learning with neural networks to an interactive robot0
Visual Grounding Annotation of Recipe Flow Graph0
Spatio-Temporal Graph for Video Captioning with Knowledge Distillation0
Giving Commands to a Self-driving Car: A Multimodal Reasoner for Visual Grounding0
Emergent Communication with World Models0
Exploring Context, Attention and Audio Features for Audio Visual Scene-Aware Dialog0
Connecting Vision and Language with Localized NarrativesCode0
Compositional Temporal Visual Grounding of Natural Language Event Descriptions0
OptiBox: Breaking the Limits of Proposals for Visual Grounding0
Leveraging Past References for Robust Language Grounding0
Countering Language Drift via Visual Grounding0
Language learning using Speech to Image retrieval0
Differentiable Disentanglement Filter: an Application Agnostic Core Concept Discovery Probe0
Multimodal Unified Attention Networks for Vision-and-Language Interactions0
Differentiable Disentanglement Filter: an Application Agnostic Core Concept Discovery Probe0
Transfer Learning from Audio-Visual Grounding to Speech Recognition0
Joint Visual Grounding with Language Scene Graphs0
Visually Grounded Neural Syntax Acquisition0
Learning to Compose and Reason with Language Tree Structures for Visual Grounding0
On the Contributions of Visual and Textual Supervision in Low-Resource Semantic Speech Retrieval0
Semantic query-by-example speech search using visual groundingCode0
Modularized Textual Grounding for Counterfactual ResilienceCode0
VQD: Visual Query Detection in Natural Scenes0
Revisiting Visual Grounding0
Learning semantic sentence representations from visually grounded language without lexical knowledgeCode0
Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment0
Dual Attention Networks for Visual Reference Resolution in Visual DialogCode0
You Only Look & Listen Once: Towards Fast and Accurate Visual GroundingCode0
Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded0
Learning to Assemble Neural Module Tree Networks for Visual Grounding0
Multi-task Learning of Hierarchical Vision-Language Representation0
Being data-driven is not enough: Revisiting interactive instruction giving as a challenge for NLG0
Overcoming Language Priors in Visual Question Answering with Adversarial Regularization0
Beyond task success: A closer look at jointly learning to see, ask, and GuessWhatCode0
Show:102550
← PrevPage 11 of 12Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)95.3Unverified
2mPLUG-2Accuracy (%)92.8Unverified
3X2-VLM (large)Accuracy (%)92.1Unverified
4XFM (base)Accuracy (%)90.4Unverified
5X2-VLM (base)Accuracy (%)90.3Unverified
6X-VLM (base)Accuracy (%)89Unverified
7HYDRAIoU61.7Unverified
8HYDRAIoU61.1Unverified
#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)92Unverified
2mPLUG-2Accuracy (%)86.05Unverified
3X2-VLM (large)Accuracy (%)81.8Unverified
4XFM (base)Accuracy (%)79.8Unverified
5X2-VLM (base)Accuracy (%)78.4Unverified
6X-VLM (base)Accuracy (%)76.91Unverified
#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)93.4Unverified
2mPLUG-2Accuracy (%)90.33Unverified
3X2-VLM (large)Accuracy (%)87.6Unverified
4XFM (base)Accuracy (%)86.1Unverified
5X2-VLM (base)Accuracy (%)85.2Unverified
6X-VLM (base)Accuracy (%)84.51Unverified