Visual Grounding
Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:
- What is the main focus in a query?
- How to understand an image?
- How to locate an object?
Papers
Showing 1–10 of 571 papers
Benchmark Results
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Florence-2-large-ft | Accuracy (%) | 95.3 | — | Unverified |
| 2 | mPLUG-2 | Accuracy (%) | 92.8 | — | Unverified |
| 3 | X2-VLM (large) | Accuracy (%) | 92.1 | — | Unverified |
| 4 | XFM (base) | Accuracy (%) | 90.4 | — | Unverified |
| 5 | X2-VLM (base) | Accuracy (%) | 90.3 | — | Unverified |
| 6 | X-VLM (base) | Accuracy (%) | 89 | — | Unverified |
| 7 | HYDRA | IoU | 61.7 | — | Unverified |
| 8 | HYDRA | IoU | 61.1 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Florence-2-large-ft | Accuracy (%) | 92 | — | Unverified |
| 2 | mPLUG-2 | Accuracy (%) | 86.05 | — | Unverified |
| 3 | X2-VLM (large) | Accuracy (%) | 81.8 | — | Unverified |
| 4 | XFM (base) | Accuracy (%) | 79.8 | — | Unverified |
| 5 | X2-VLM (base) | Accuracy (%) | 78.4 | — | Unverified |
| 6 | X-VLM (base) | Accuracy (%) | 76.91 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Florence-2-large-ft | Accuracy (%) | 93.4 | — | Unverified |
| 2 | mPLUG-2 | Accuracy (%) | 90.33 | — | Unverified |
| 3 | X2-VLM (large) | Accuracy (%) | 87.6 | — | Unverified |
| 4 | XFM (base) | Accuracy (%) | 86.1 | — | Unverified |
| 5 | X2-VLM (base) | Accuracy (%) | 85.2 | — | Unverified |
| 6 | X-VLM (base) | Accuracy (%) | 84.51 | — | Unverified |