SOTAVerified

Image-to-Text Retrieval

Image-text retrieval is the process of retrieving relevant images based on textual descriptions or finding corresponding textual descriptions for a given image. This task is interdisciplinary, combining techniques from computer vision, and natural language processing. The primary challenge lies in bridging the semantic gap — the difference between how visual data is represented in images and how humans describe that information using language. To address this, many methods focus on learning a shared embedding space where both images and text can be represented in a comparable way, allowing their similarities to be measured and facilitating more accurate retrieval.

Source: Extending CLIP for Category-to-Image Retrieval in E-commerce

Papers

Showing 125 of 59 papers

TitleStatusHype
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language ModelsCode4
AltCLIP: Altering the Language Encoder in CLIP for Extended Language CapabilitiesCode4
ONE-PEACE: Exploring One General Representation Model Toward Unlimited ModalitiesCode3
Sigmoid Loss for Language Image Pre-TrainingCode3
Efficient Remote Sensing with Harmonized Transfer Learning and Modality AlignmentCode2
Linguistic-Aware Patch Slimming Framework for Fine-grained Cross-Modal AlignmentCode2
RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote SensingCode2
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEsCode2
Learning Transferable Visual Models From Natural Language SupervisionCode2
Oscar: Object-Semantics Aligned Pre-training for Vision-Language TasksCode2
Efficient Medical Vision-Language Alignment Through Adapting Masked Vision ModelsCode1
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic TasksCode1
Negative Pre-aware for Noisy Cross-modal MatchingCode1
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal RetrievalCode1
Vision-Language Dataset DistillationCode1
PRIOR: Prototype Representation Joint Learning from Medical Images and ReportsCode1
CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language TransformersCode1
Rethinking Benchmarks for Cross-modal Image-text RetrievalCode1
UPop: Unified and Progressive Pruning for Compressing Vision-Language TransformersCode1
A Differentiable Semantic Metric Approximation in Probabilistic Embedding for Cross-Modal RetrievalCode1
FETA: Towards Specializing Foundation Models for Expert Task ApplicationsCode1
IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and LanguagesCode1
FLAVA: A Foundational Language And Vision Alignment ModelCode1
Align before Fuse: Vision and Language Representation Learning with Momentum DistillationCode1
A Deep Local and Global Scene-Graph Matching for Image-Text RetrievalCode1
Show:102550
← PrevPage 1 of 3Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1OscarRecall@1099.8Unverified
2OscarRecall@1098.3Unverified
3Unicoder-VLRecall@1097.2Unverified
4BLIP-2 (ViT-G, fine-tuned)Recall@185.4Unverified
5ONE-PEACE (ViT-G, w/o ranking)Recall@184.1Unverified
6BLIP-2 (ViT-L, fine-tuned)Recall@183.5Unverified
7DVSARecall@1074.8Unverified
8IAISRecall@167.78Unverified
9CLIP (zero-shot)Recall@158.4Unverified
10FLAVA (ViT-B, zero-shot)Recall@142.74Unverified
#ModelMetricClaimedVerifiedStatus
1InternVL-G-FT (finetuned, w/o ranking)Recall@197.9Unverified
2ONE-PEACE (finetuned, w/o ranking)Recall@197.6Unverified
3BLIP-2 ViT-G (zero-shot, 1K test set)Recall@197.6Unverified
4InternVL-C-FT (finetuned, w/o ranking)Recall@197.2Unverified
5BLIP-2 ViT-L (zero-shot, 1K test set)Recall@196.9Unverified
6ERNIE-ViL 2.0Recall@196.1Unverified
7ALBEFRecall@195.9Unverified
8UNITERRecall@187.3Unverified
9GSMNRecall@176.4Unverified
10LGSGMRecall@171Unverified
#ModelMetricClaimedVerifiedStatus
1BLIP2 FlanT5-XXL (Text-only FT)Specificity94Unverified
2BLIP2 FlanT5-XXL (Fine-tuned)Specificity84Unverified
3BLIP2 FlanT5-XL (Fine-tuned)Specificity81Unverified
4BLIP LargeSpecificity77Unverified
5CoCa ViT-L-14 MSCOCOSpecificity72Unverified
6BLIP2 FlanT5-XXL (Zero-shot)Specificity71Unverified
7CLIP ViT-L/14Specificity70Unverified
#ModelMetricClaimedVerifiedStatus
1ERNIE-ViL2.0Recall@133.7Unverified
2CMCLRecall@120.3Unverified
3ERNIE-ViL2.0Recall@119Unverified
#ModelMetricClaimedVerifiedStatus
1FETA's CLIP-MIL (Many-Shot Image-to-text)R@135.5Unverified
2FETA's CLIP-MIL (Many-Shot Image-to-text)R@129Unverified
#ModelMetricClaimedVerifiedStatus
1CMCLRecall@136.1Unverified
2CMCLRecall@136Unverified
#ModelMetricClaimedVerifiedStatus
1SigLIP (ViT-L, zero-shot)Recall@170.6Unverified
#ModelMetricClaimedVerifiedStatus
1GeoRSCLIP-FTImage to Text Recall@122.14Unverified