Image-to-Text Retrieval

Image-text retrieval is the process of retrieving relevant images based on textual descriptions or finding corresponding textual descriptions for a given image. This task is interdisciplinary, combining techniques from computer vision, and natural language processing. The primary challenge lies in bridging the semantic gap — the difference between how visual data is represented in images and how humans describe that information using language. To address this, many methods focus on learning a shared embedding space where both images and text can be represented in a comparable way, allowing their similarities to be measured and facilitating more accurate retrieval.

Source: Extending CLIP for Category-to-Image Retrieval in E-commerce

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–50 of 59 papers

Title	Date	Tasks	Status	Hype
AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities	Nov 12, 2022	Contrastive LearningCross-Modal Retrieval	CodeCode Available	4
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	Jan 30, 2023	Generative Visual Question AnsweringImage Captioning	CodeCode Available	4
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities	May 18, 2023	1 Image, 2*2 StitchiAction Classification	CodeCode Available	3
Sigmoid Loss for Language Image Pre-Training	Mar 27, 2023	Contrastive LearningDisentanglement	CodeCode Available	3
Linguistic-Aware Patch Slimming Framework for Fine-grained Cross-Modal Alignment	Jan 1, 2024	cross-modal alignmentCross-Modal Retrieval	CodeCode Available	2
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs	Jun 9, 2022	Image CaptioningImage Classification	CodeCode Available	2
RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing	Jun 20, 2023	Cross-Modal RetrievalImage Retrieval	CodeCode Available	2
Learning Transferable Visual Models From Natural Language Supervision	Feb 26, 2021	Action RecognitionBenchmarking	CodeCode Available	2
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks	Apr 13, 2020	Cross-Modal RetrievalImage Captioning	CodeCode Available	2
Efficient Remote Sensing with Harmonized Transfer Learning and Modality Alignment	Apr 28, 2024	Cross-Modal RetrievalImage Retrieval	CodeCode Available	2
CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers	May 27, 2023	Image CaptioningImage Retrieval	CodeCode Available	1
A Deep Local and Global Scene-Graph Matching for Image-Text Retrieval	Jun 4, 2021	Graph MatchingImage Retrieval	CodeCode Available	1
A Differentiable Semantic Metric Approximation in Probabilistic Embedding for Cross-Modal Retrieval	Dec 6, 2022	Cross-Modal RetrievalImage-text matching	CodeCode Available	1
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation	Jul 16, 2021	Cross-Modal RetrievalGrounded language learning	CodeCode Available	1
Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models	Jun 10, 2025	Contrastive LearningImage-text matching	CodeCode Available	1
FETA: Towards Specializing Foundation Models for Expert Task Applications	Sep 8, 2022	Domain GeneralizationFew-Shot Learning	CodeCode Available	1
FLAVA: A Foundational Language And Vision Alignment Model	Dec 8, 2021	Image RetrievalImage-to-Text Retrieval	CodeCode Available	1
IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages	Jan 27, 2022	Cross-Modal RetrievalFew-Shot Learning	CodeCode Available	1
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks	Dec 21, 2023	Image RetrievalImage-to-Text Retrieval	CodeCode Available	1
Learning Relation Alignment for Calibrated Cross-modal Retrieval	May 28, 2021	Cross-Modal RetrievalImage-text Retrieval	CodeCode Available	1
Vision-Language Dataset Distillation	Aug 15, 2023	Dataset Distillationimage-classification	CodeCode Available	1
Negative Pre-aware for Noisy Cross-modal Matching	Dec 10, 2023	Cross-modal retrieval with noisy correspondenceImage-text matching	CodeCode Available	1
PRIOR: Prototype Representation Joint Learning from Medical Images and Reports	Jul 24, 2023	Contrastive LearningImage to text	CodeCode Available	1
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval	Sep 29, 2023	Cross-Modal RetrievalImage-text matching	CodeCode Available	1
Rethinking Benchmarks for Cross-modal Image-text Retrieval	Apr 21, 2023	Cross-Modal RetrievalImage-text Retrieval	CodeCode Available	1
UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers	Jan 31, 2023	Image CaptioningImage Classification	CodeCode Available	1
WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training	Mar 11, 2021	Contrastive LearningGPU	CodeCode Available	1
Accept the Modality Gap: An Exploration in the Hyperbolic Space	Jan 1, 2024	Image to textImage-to-Text Retrieval	—Unverified	0
Is Cross-modal Information Retrieval Possible without Training?	Apr 20, 2023	Contrastive LearningCross-Modal Information Retrieval	—Unverified	0
DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding	Dec 2, 2024	Caption GenerationDomain Generalization	—Unverified	0
Robotic Environmental State Recognition with Pre-Trained Vision-Language Models and Black-Box Optimization	Sep 26, 2024	Image to textImage-to-Text Retrieval	—Unverified	0
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval	Apr 15, 2022	Contrastive LearningCross-Modal Retrieval	—Unverified	0
Hierarchical Gumbel Attention Network for Text-based Person Search	Oct 10, 2020	Image RetrievalImage to text	—Unverified	0
CLIP the Bias: How Useful is Balancing Data in Multimodal Learning?	Mar 7, 2024	Image to textImage-to-Text Retrieval	—Unverified	0
Robotic State Recognition with Image-to-Text Retrieval Task of Pre-Trained Vision-Language Model and Black-Box Optimization	Oct 30, 2024	Image to textImage-to-Text Retrieval	—Unverified	0
Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	Mar 13, 2023	Common Sense ReasoningExplanation Generation	—Unverified	0
Towards a Visual-Language Foundation Model for Computational Pathology	Jul 24, 2023	Contrastive Learningimage-classification	—Unverified	0
Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning	May 26, 2024	Image to textImage-to-Text Retrieval	—Unverified	0
SemCORE: A Semantic-Enhanced Generative Cross-Modal Retrieval Framework with MLLMs	Apr 17, 2025	Cross-Modal RetrievalImage Retrieval	—Unverified	0
Improving Medical Visual Representation Learning with Pathological-level Cross-Modal Alignment and Correlation Exploration	Jun 12, 2025	cross-modal alignmentImage to text	—Unverified	0
A survey on knowledge-enhanced multimodal learning	Nov 19, 2022	Conditional Image GenerationFactual Visual Question Answering	—Unverified	0
Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text Retrieval	Jul 29, 2022	Cross-Modal RetrievalData Augmentation	—Unverified	0
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training	Aug 16, 2019	Image-text matchingImage-text Retrieval	—Unverified	0
ABC: Achieving Better Control of Multimodal Embeddings using VLMs	Mar 1, 2025	Image to textImage-to-Text Retrieval	—Unverified	0
Retaining Knowledge and Enhancing Long-Text Representations in CLIP through Dual-Teacher Distillation	Jan 1, 2025	image-classificationImage Classification	—Unverified	0
When are Lemons Purple? The Concept Association Bias of Vision-Language Models	Dec 22, 2022	Attributeimage-classification	—Unverified	0
Towards Cross-modal Retrieval in Chinese Cultural Heritage Documents: Dataset and Solution	May 16, 2025	Cross-Modal RetrievalImage to text	—Unverified	0
Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset	May 25, 2022	Image CaptioningImage Retrieval	—Unverified	0
DART: Disease-aware Image-Text Alignment and Self-correcting Re-alignment for Trustworthy Radiology Report Generation	Apr 16, 2025	Contrastive LearningImage to text	—Unverified	0
GABInsight: Exploring Gender-Activity Binding Bias in Vision-Language Models	Jul 30, 2024	Image to textImage-to-Text Retrieval	CodeCode Available	0

Show:10 25 50

← PrevPage 1 of 2Next →

All datasets COCO (Common Objects in Context)Flickr30k WHOOPS!AIC-ICC FETA Car-Manuals RUC-CAS-WenLan COCO RSICD

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Oscar	Recall@10	99.8	—	Unverified
2	Oscar	Recall@10	98.3	—	Unverified
3	Unicoder-VL	Recall@10	97.2	—	Unverified
4	BLIP-2 (ViT-G, fine-tuned)	Recall@1	85.4	—	Unverified
5	ONE-PEACE (ViT-G, w/o ranking)	Recall@1	84.1	—	Unverified
6	BLIP-2 (ViT-L, fine-tuned)	Recall@1	83.5	—	Unverified
7	DVSA	Recall@10	74.8	—	Unverified
8	IAIS	Recall@1	67.78	—	Unverified
9	CLIP (zero-shot)	Recall@1	58.4	—	Unverified
10	FLAVA (ViT-B, zero-shot)	Recall@1	42.74	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	InternVL-G-FT (finetuned, w/o ranking)	Recall@1	97.9	—	Unverified
2	ONE-PEACE (finetuned, w/o ranking)	Recall@1	97.6	—	Unverified
3	BLIP-2 ViT-G (zero-shot, 1K test set)	Recall@1	97.6	—	Unverified
4	InternVL-C-FT (finetuned, w/o ranking)	Recall@1	97.2	—	Unverified
5	BLIP-2 ViT-L (zero-shot, 1K test set)	Recall@1	96.9	—	Unverified
6	ERNIE-ViL 2.0	Recall@1	96.1	—	Unverified
7	ALBEF	Recall@1	95.9	—	Unverified
8	UNITER	Recall@1	87.3	—	Unverified
9	GSMN	Recall@1	76.4	—	Unverified
10	LGSGM	Recall@1	71	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	BLIP2 FlanT5-XXL (Text-only FT)	Specificity	94	—	Unverified
2	BLIP2 FlanT5-XXL (Fine-tuned)	Specificity	84	—	Unverified
3	BLIP2 FlanT5-XL (Fine-tuned)	Specificity	81	—	Unverified
4	BLIP Large	Specificity	77	—	Unverified
5	CoCa ViT-L-14 MSCOCO	Specificity	72	—	Unverified
6	BLIP2 FlanT5-XXL (Zero-shot)	Specificity	71	—	Unverified
7	CLIP ViT-L/14	Specificity	70	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	ERNIE-ViL2.0	Recall@1	33.7	—	Unverified
2	CMCL	Recall@1	20.3	—	Unverified
3	ERNIE-ViL2.0	Recall@1	19	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	FETA's CLIP-MIL (Many-Shot Image-to-text)	R@1	35.5	—	Unverified
2	FETA's CLIP-MIL (Many-Shot Image-to-text)	R@1	29	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	CMCL	Recall@1	36.1	—	Unverified
2	CMCL	Recall@1	36	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	SigLIP (ViT-L, zero-shot)	Recall@1	70.6	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GeoRSCLIP-FT	Image to Text Recall@1	22.14	—	Unverified