SOTAVerified

Image-to-Text Retrieval

Image-text retrieval is the process of retrieving relevant images based on textual descriptions or finding corresponding textual descriptions for a given image. This task is interdisciplinary, combining techniques from computer vision, and natural language processing. The primary challenge lies in bridging the semantic gap — the difference between how visual data is represented in images and how humans describe that information using language. To address this, many methods focus on learning a shared embedding space where both images and text can be represented in a comparable way, allowing their similarities to be measured and facilitating more accurate retrieval.

Source: Extending CLIP for Category-to-Image Retrieval in E-commerce

Papers

Showing 110 of 59 papers

TitleStatusHype
Improving Medical Visual Representation Learning with Pathological-level Cross-Modal Alignment and Correlation Exploration0
Efficient Medical Vision-Language Alignment Through Adapting Masked Vision ModelsCode1
Towards Cross-modal Retrieval in Chinese Cultural Heritage Documents: Dataset and Solution0
SemCORE: A Semantic-Enhanced Generative Cross-Modal Retrieval Framework with MLLMs0
DART: Disease-aware Image-Text Alignment and Self-correcting Re-alignment for Trustworthy Radiology Report Generation0
ABC: Achieving Better Control of Multimodal Embeddings using VLMs0
Retaining Knowledge and Enhancing Long-Text Representations in CLIP through Dual-Teacher Distillation0
DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding0
Robotic State Recognition with Image-to-Text Retrieval Task of Pre-Trained Vision-Language Model and Black-Box Optimization0
Robotic Environmental State Recognition with Pre-Trained Vision-Language Models and Black-Box Optimization0
Show:102550
← PrevPage 1 of 6Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1OscarRecall@1099.8Unverified
2OscarRecall@1098.3Unverified
3Unicoder-VLRecall@1097.2Unverified
4BLIP-2 (ViT-G, fine-tuned)Recall@185.4Unverified
5ONE-PEACE (ViT-G, w/o ranking)Recall@184.1Unverified
6BLIP-2 (ViT-L, fine-tuned)Recall@183.5Unverified
7DVSARecall@1074.8Unverified
8IAISRecall@167.78Unverified
9CLIP (zero-shot)Recall@158.4Unverified
10FLAVA (ViT-B, zero-shot)Recall@142.74Unverified
#ModelMetricClaimedVerifiedStatus
1InternVL-G-FT (finetuned, w/o ranking)Recall@197.9Unverified
2ONE-PEACE (finetuned, w/o ranking)Recall@197.6Unverified
3BLIP-2 ViT-G (zero-shot, 1K test set)Recall@197.6Unverified
4InternVL-C-FT (finetuned, w/o ranking)Recall@197.2Unverified
5BLIP-2 ViT-L (zero-shot, 1K test set)Recall@196.9Unverified
6ERNIE-ViL 2.0Recall@196.1Unverified
7ALBEFRecall@195.9Unverified
8UNITERRecall@187.3Unverified
9GSMNRecall@176.4Unverified
10LGSGMRecall@171Unverified
#ModelMetricClaimedVerifiedStatus
1BLIP2 FlanT5-XXL (Text-only FT)Specificity94Unverified
2BLIP2 FlanT5-XXL (Fine-tuned)Specificity84Unverified
3BLIP2 FlanT5-XL (Fine-tuned)Specificity81Unverified
4BLIP LargeSpecificity77Unverified
5CoCa ViT-L-14 MSCOCOSpecificity72Unverified
6BLIP2 FlanT5-XXL (Zero-shot)Specificity71Unverified
7CLIP ViT-L/14Specificity70Unverified
#ModelMetricClaimedVerifiedStatus
1ERNIE-ViL2.0Recall@133.7Unverified
2CMCLRecall@120.3Unverified
3ERNIE-ViL2.0Recall@119Unverified
#ModelMetricClaimedVerifiedStatus
1FETA's CLIP-MIL (Many-Shot Image-to-text)R@135.5Unverified
2FETA's CLIP-MIL (Many-Shot Image-to-text)R@129Unverified
#ModelMetricClaimedVerifiedStatus
1CMCLRecall@136.1Unverified
2CMCLRecall@136Unverified
#ModelMetricClaimedVerifiedStatus
1SigLIP (ViT-L, zero-shot)Recall@170.6Unverified
#ModelMetricClaimedVerifiedStatus
1GeoRSCLIP-FTImage to Text Recall@122.14Unverified