Cross-Modal Retrieval
Cross-Modal Retrieval (CMR) is a task of retrieving items across different modalities, such as image, text, video, and audio. The core challenge of CMR is the heterogeneity gap, which arises because data from different modalities have distinct representations, making direct comparison difficult. To address this, most CMR methods focus on learning a shared latent embedding space. In this space, concepts from different modalities are projected, allowing their similarity to be measured using a distance metric.
Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study
Papers
Showing 1–10 of 522 papers
All datasetsCOCO 2014Flickr30kRecipe1M+RSICDRSITMDChEBI-20MSCOCO-1kSoundingEarthCUHK-PEDESFlickr-8kMSCOCOMS-COCO-2014
Benchmark Results
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | MaMMUT (ours) | Image-to-text R@1 | 70.7 | — | Unverified |
| 2 | VAST | Text-to-image R@1 | 68 | — | Unverified |
| 3 | X2-VLM (large) | Text-to-image R@1 | 67.7 | — | Unverified |
| 4 | BEiT-3 | Text-to-image R@1 | 67.2 | — | Unverified |
| 5 | XFM (base) | Text-to-image R@1 | 67 | — | Unverified |
| 6 | X2-VLM (base) | Text-to-image R@1 | 66.2 | — | Unverified |
| 7 | PTP-BLIP (14M) | Text-to-image R@1 | 64.9 | — | Unverified |
| 8 | OmniVL (14M) | Text-to-image R@1 | 64.8 | — | Unverified |
| 9 | VSE-Gradient | Text-to-image R@1 | 63.6 | — | Unverified |
| 10 | X-VLM (base) | Text-to-image R@1 | 63.4 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | X2-VLM (large) | Image-to-text R@1 | 98.8 | — | Unverified |
| 2 | X2-VLM (base) | Image-to-text R@1 | 98.5 | — | Unverified |
| 3 | BEiT-3 | Image-to-text R@1 | 98 | — | Unverified |
| 4 | OmniVL (14M) | Image-to-text R@1 | 97.3 | — | Unverified |
| 5 | Aurora (ours, r=128) | Image-to-text R@1 | 97.2 | — | Unverified |
| 6 | ERNIE-ViL 2.0 | Image-to-text R@1 | 97.2 | — | Unverified |
| 7 | X-VLM (base) | Image-to-text R@1 | 97.1 | — | Unverified |
| 8 | VSE-Gradient | Image-to-text R@1 | 97 | — | Unverified |
| 9 | ALIGN | Image-to-text R@1 | 95.3 | — | Unverified |
| 10 | VAST | Text-to-image R@1 | 91 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | VLPCook (R1M+) | Image-to-text R@1 | 74.9 | — | Unverified |
| 2 | VLPCook | Image-to-text R@1 | 73.6 | — | Unverified |
| 3 | T-Food (CLIP) | Image-to-text R@1 | 72.3 | — | Unverified |
| 4 | T-Food | Image-to-text R@1 | 68.2 | — | Unverified |
| 5 | X-MRS | Image-to-text R@1 | 64 | — | Unverified |
| 6 | H-T | Image-to-text R@1 | 60 | — | Unverified |
| 7 | SCAN | Image-to-text R@1 | 54 | — | Unverified |
| 8 | ACME | Image-to-text R@1 | 51.8 | — | Unverified |
| 9 | VLPCook | Image-to-text R@1 | 45.2 | — | Unverified |
| 10 | AdaMine | Image-to-text R@1 | 39.8 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | HarMA (w/ GeoRSCLIP) | Mean Recall | 38.95 | — | Unverified |
| 2 | GeoRSCLIP-FT | Mean Recall | 38.87 | — | Unverified |
| 3 | GLISA | Mean Recall | 37.69 | — | Unverified |
| 4 | RemoteCLIP | Mean Recall | 36.35 | — | Unverified |
| 5 | PE-RSITR (MRS-Adapter) | Mean Recall | 31.12 | — | Unverified |
| 6 | PIR | Mean Recall | 24.46 | — | Unverified |
| 7 | DOVE | Mean Recall | 22.72 | — | Unverified |
| 8 | SWAN | Mean Recall | 20.61 | — | Unverified |
| 9 | GaLR | Mean Recall | 18.96 | — | Unverified |
| 10 | AMFMN | Mean Recall | 15.53 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | HarMA (w/ GeoRSCLIP) | Image-to-text R@1 | 32.74 | — | Unverified |
| 2 | GeoRSCLIP-FT | Image-to-text R@1 | 32.3 | — | Unverified |
| 3 | GLISA | Image-to-text R@1 | 32.08 | — | Unverified |
| 4 | RemoteCLIP | Image-to-text R@1 | 28.76 | — | Unverified |
| 5 | PE-RSITR (MRS-Adapter) | Image-to-text R@1 | 23.67 | — | Unverified |
| 6 | PIR | Image-to-text R@1 | 18.14 | — | Unverified |
| 7 | DOVE | Image-to-text R@1 | 16.81 | — | Unverified |
| 8 | GaLR | Image-to-text R@1 | 14.82 | — | Unverified |
| 9 | SWAN | Image-to-text R@1 | 13.35 | — | Unverified |
| 10 | AMFMN | Image-to-text R@1 | 10.63 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | CLASS (ORMA) | Hits@1 | 67.4 | — | Unverified |
| 2 | ORMA | Hits@1 | 66.5 | — | Unverified |
| 3 | Song et al. | Hits@1 | 56.5 | — | Unverified |
| 4 | CLASS (AMAN) | Hits@1 | 51.1 | — | Unverified |
| 5 | DSOKR | Hits@1 | 51 | — | Unverified |
| 6 | AMAN | Hits@1 | 49.4 | — | Unverified |
| 7 | All-Ensemble | Hits@1 | 34.4 | — | Unverified |
| 8 | MLP1 | Hits@1 | 22.4 | — | Unverified |
| 9 | GCN2 | Hits@1 | 22.3 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | NAPReg | Image-to-text R@1 | 81.9 | — | Unverified |
| 2 | Dual-path CNN | Image-to-text R@1 | 41.2 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Dual Path | Text-to-image Medr | 2 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | NAPReg | Image-to-text R@1 | 56.2 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | 3SHNet | Image-to-text R@1 | 85.8 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | NAPReg | Text-to-image R@1 | 43 | — | Unverified |