SOTAVerified

Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

  • What is the main focus in a query?
  • How to understand an image?
  • How to locate an object?

Papers

Showing 201210 of 571 papers

TitleStatusHype
Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMsCode1
How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape GameCode1
Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue GenerationCode1
Multi-task Visual Grounding with Coarse-to-Fine Consistency ConstraintsCode1
Dual Attention Networks for Visual Reference Resolution in Visual DialogCode0
DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document ImagesCode0
ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual GroundingCode0
Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language ModelsCode0
Discovering the Unknown Knowns: Turning Implicit Knowledge in the Dataset into Explicit Training Examples for Visual Question AnsweringCode0
Beyond Human Perception: Understanding Multi-Object World from Monocular ViewCode0
Show:102550
← PrevPage 21 of 58Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)95.3Unverified
2mPLUG-2Accuracy (%)92.8Unverified
3X2-VLM (large)Accuracy (%)92.1Unverified
4XFM (base)Accuracy (%)90.4Unverified
5X2-VLM (base)Accuracy (%)90.3Unverified
6X-VLM (base)Accuracy (%)89Unverified
7HYDRAIoU61.7Unverified
8HYDRAIoU61.1Unverified
#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)92Unverified
2mPLUG-2Accuracy (%)86.05Unverified
3X2-VLM (large)Accuracy (%)81.8Unverified
4XFM (base)Accuracy (%)79.8Unverified
5X2-VLM (base)Accuracy (%)78.4Unverified
6X-VLM (base)Accuracy (%)76.91Unverified
#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)93.4Unverified
2mPLUG-2Accuracy (%)90.33Unverified
3X2-VLM (large)Accuracy (%)87.6Unverified
4XFM (base)Accuracy (%)86.1Unverified
5X2-VLM (base)Accuracy (%)85.2Unverified
6X-VLM (base)Accuracy (%)84.51Unverified