Image-to-Text Retrieval
Image-text retrieval is the process of retrieving relevant images based on textual descriptions or finding corresponding textual descriptions for a given image. This task is interdisciplinary, combining techniques from computer vision, and natural language processing. The primary challenge lies in bridging the semantic gap — the difference between how visual data is represented in images and how humans describe that information using language. To address this, many methods focus on learning a shared embedding space where both images and text can be represented in a comparable way, allowing their similarities to be measured and facilitating more accurate retrieval.
Source: Extending CLIP for Category-to-Image Retrieval in E-commerce
Papers
Showing 1–10 of 59 papers
Benchmark Results
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | ERNIE-ViL2.0 | Recall@1 | 33.7 | — | Unverified |
| 2 | CMCL | Recall@1 | 20.3 | — | Unverified |
| 3 | ERNIE-ViL2.0 | Recall@1 | 19 | — | Unverified |