Image-to-Text Retrieval
Image-text retrieval is the process of retrieving relevant images based on textual descriptions or finding corresponding textual descriptions for a given image. This task is interdisciplinary, combining techniques from computer vision, and natural language processing. The primary challenge lies in bridging the semantic gap — the difference between how visual data is represented in images and how humans describe that information using language. To address this, many methods focus on learning a shared embedding space where both images and text can be represented in a comparable way, allowing their similarities to be measured and facilitating more accurate retrieval.
Source: Extending CLIP for Category-to-Image Retrieval in E-commerce
Papers
Showing 1–10 of 59 papers
Benchmark Results
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Oscar | Recall@10 | 99.8 | — | Unverified |
| 2 | Oscar | Recall@10 | 98.3 | — | Unverified |
| 3 | Unicoder-VL | Recall@10 | 97.2 | — | Unverified |
| 4 | BLIP-2 (ViT-G, fine-tuned) | Recall@1 | 85.4 | — | Unverified |
| 5 | ONE-PEACE (ViT-G, w/o ranking) | Recall@1 | 84.1 | — | Unverified |
| 6 | BLIP-2 (ViT-L, fine-tuned) | Recall@1 | 83.5 | — | Unverified |
| 7 | DVSA | Recall@10 | 74.8 | — | Unverified |
| 8 | IAIS | Recall@1 | 67.78 | — | Unverified |
| 9 | CLIP (zero-shot) | Recall@1 | 58.4 | — | Unverified |
| 10 | FLAVA (ViT-B, zero-shot) | Recall@1 | 42.74 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | InternVL-G-FT (finetuned, w/o ranking) | Recall@1 | 97.9 | — | Unverified |
| 2 | ONE-PEACE (finetuned, w/o ranking) | Recall@1 | 97.6 | — | Unverified |
| 3 | BLIP-2 ViT-G (zero-shot, 1K test set) | Recall@1 | 97.6 | — | Unverified |
| 4 | InternVL-C-FT (finetuned, w/o ranking) | Recall@1 | 97.2 | — | Unverified |
| 5 | BLIP-2 ViT-L (zero-shot, 1K test set) | Recall@1 | 96.9 | — | Unverified |
| 6 | ERNIE-ViL 2.0 | Recall@1 | 96.1 | — | Unverified |
| 7 | ALBEF | Recall@1 | 95.9 | — | Unverified |
| 8 | UNITER | Recall@1 | 87.3 | — | Unverified |
| 9 | GSMN | Recall@1 | 76.4 | — | Unverified |
| 10 | LGSGM | Recall@1 | 71 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | BLIP2 FlanT5-XXL (Text-only FT) | Specificity | 94 | — | Unverified |
| 2 | BLIP2 FlanT5-XXL (Fine-tuned) | Specificity | 84 | — | Unverified |
| 3 | BLIP2 FlanT5-XL (Fine-tuned) | Specificity | 81 | — | Unverified |
| 4 | BLIP Large | Specificity | 77 | — | Unverified |
| 5 | CoCa ViT-L-14 MSCOCO | Specificity | 72 | — | Unverified |
| 6 | BLIP2 FlanT5-XXL (Zero-shot) | Specificity | 71 | — | Unverified |
| 7 | CLIP ViT-L/14 | Specificity | 70 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | ERNIE-ViL2.0 | Recall@1 | 33.7 | — | Unverified |
| 2 | CMCL | Recall@1 | 20.3 | — | Unverified |
| 3 | ERNIE-ViL2.0 | Recall@1 | 19 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | FETA's CLIP-MIL (Many-Shot Image-to-text) | R@1 | 35.5 | — | Unverified |
| 2 | FETA's CLIP-MIL (Many-Shot Image-to-text) | R@1 | 29 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | SigLIP (ViT-L, zero-shot) | Recall@1 | 70.6 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | GeoRSCLIP-FT | Image to Text Recall@1 | 22.14 | — | Unverified |