SOTAVerified

Image-to-Text Retrieval

Image-text retrieval is the process of retrieving relevant images based on textual descriptions or finding corresponding textual descriptions for a given image. This task is interdisciplinary, combining techniques from computer vision, and natural language processing. The primary challenge lies in bridging the semantic gap — the difference between how visual data is represented in images and how humans describe that information using language. To address this, many methods focus on learning a shared embedding space where both images and text can be represented in a comparable way, allowing their similarities to be measured and facilitating more accurate retrieval.

Source: Extending CLIP for Category-to-Image Retrieval in E-commerce

Papers

Showing 4150 of 59 papers

TitleStatusHype
Design of the topology for contrastive visual-textual alignmentCode0
Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text Retrieval0
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEsCode2
Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset0
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval0
IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and LanguagesCode1
FLAVA: A Foundational Language And Vision Alignment ModelCode1
Align before Fuse: Vision and Language Representation Learning with Momentum DistillationCode1
OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and GenerationCode0
A Deep Local and Global Scene-Graph Matching for Image-Text RetrievalCode1
Show:102550
← PrevPage 5 of 6Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1OscarRecall@1099.8Unverified
2OscarRecall@1098.3Unverified
3Unicoder-VLRecall@1097.2Unverified
4BLIP-2 (ViT-G, fine-tuned)Recall@185.4Unverified
5ONE-PEACE (ViT-G, w/o ranking)Recall@184.1Unverified
6BLIP-2 (ViT-L, fine-tuned)Recall@183.5Unverified
7DVSARecall@1074.8Unverified
8IAISRecall@167.78Unverified
9CLIP (zero-shot)Recall@158.4Unverified
10FLAVA (ViT-B, zero-shot)Recall@142.74Unverified
#ModelMetricClaimedVerifiedStatus
1InternVL-G-FT (finetuned, w/o ranking)Recall@197.9Unverified
2ONE-PEACE (finetuned, w/o ranking)Recall@197.6Unverified
3BLIP-2 ViT-G (zero-shot, 1K test set)Recall@197.6Unverified
4InternVL-C-FT (finetuned, w/o ranking)Recall@197.2Unverified
5BLIP-2 ViT-L (zero-shot, 1K test set)Recall@196.9Unverified
6ERNIE-ViL 2.0Recall@196.1Unverified
7ALBEFRecall@195.9Unverified
8UNITERRecall@187.3Unverified
9GSMNRecall@176.4Unverified
10LGSGMRecall@171Unverified
#ModelMetricClaimedVerifiedStatus
1BLIP2 FlanT5-XXL (Text-only FT)Specificity94Unverified
2BLIP2 FlanT5-XXL (Fine-tuned)Specificity84Unverified
3BLIP2 FlanT5-XL (Fine-tuned)Specificity81Unverified
4BLIP LargeSpecificity77Unverified
5CoCa ViT-L-14 MSCOCOSpecificity72Unverified
6BLIP2 FlanT5-XXL (Zero-shot)Specificity71Unverified
7CLIP ViT-L/14Specificity70Unverified
#ModelMetricClaimedVerifiedStatus
1ERNIE-ViL2.0Recall@133.7Unverified
2CMCLRecall@120.3Unverified
3ERNIE-ViL2.0Recall@119Unverified
#ModelMetricClaimedVerifiedStatus
1FETA's CLIP-MIL (Many-Shot Image-to-text)R@135.5Unverified
2FETA's CLIP-MIL (Many-Shot Image-to-text)R@129Unverified
#ModelMetricClaimedVerifiedStatus
1CMCLRecall@136.1Unverified
2CMCLRecall@136Unverified
#ModelMetricClaimedVerifiedStatus
1SigLIP (ViT-L, zero-shot)Recall@170.6Unverified
#ModelMetricClaimedVerifiedStatus
1GeoRSCLIP-FTImage to Text Recall@122.14Unverified