Image-to-Text Retrieval

Image-text retrieval is the process of retrieving relevant images based on textual descriptions or finding corresponding textual descriptions for a given image. This task is interdisciplinary, combining techniques from computer vision, and natural language processing. The primary challenge lies in bridging the semantic gap — the difference between how visual data is represented in images and how humans describe that information using language. To address this, many methods focus on learning a shared embedding space where both images and text can be represented in a comparable way, allowing their similarities to be measured and facilitating more accurate retrieval.

Source: Extending CLIP for Category-to-Image Retrieval in E-commerce

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–10 of 59 papers

Title	Date	Tasks	Status	Hype
Improving Medical Visual Representation Learning with Pathological-level Cross-Modal Alignment and Correlation Exploration	Jun 12, 2025	cross-modal alignmentImage to text	—Unverified	0
Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models	Jun 10, 2025	Contrastive LearningImage-text matching	CodeCode Available	1
Towards Cross-modal Retrieval in Chinese Cultural Heritage Documents: Dataset and Solution	May 16, 2025	Cross-Modal RetrievalImage to text	—Unverified	0
SemCORE: A Semantic-Enhanced Generative Cross-Modal Retrieval Framework with MLLMs	Apr 17, 2025	Cross-Modal RetrievalImage Retrieval	—Unverified	0
DART: Disease-aware Image-Text Alignment and Self-correcting Re-alignment for Trustworthy Radiology Report Generation	Apr 16, 2025	Contrastive LearningImage to text	—Unverified	0
ABC: Achieving Better Control of Multimodal Embeddings using VLMs	Mar 1, 2025	Image to textImage-to-Text Retrieval	—Unverified	0
Retaining Knowledge and Enhancing Long-Text Representations in CLIP through Dual-Teacher Distillation	Jan 1, 2025	image-classificationImage Classification	—Unverified	0
DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding	Dec 2, 2024	Caption GenerationDomain Generalization	—Unverified	0
Robotic State Recognition with Image-to-Text Retrieval Task of Pre-Trained Vision-Language Model and Black-Box Optimization	Oct 30, 2024	Image to textImage-to-Text Retrieval	—Unverified	0
Robotic Environmental State Recognition with Pre-Trained Vision-Language Models and Black-Box Optimization	Sep 26, 2024	Image to textImage-to-Text Retrieval	—Unverified	0

Show:10 25 50

← PrevPage 1 of 6Next →

All datasets COCO (Common Objects in Context)Flickr30k WHOOPS!AIC-ICC FETA Car-Manuals RUC-CAS-WenLan COCO RSICD

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	InternVL-G-FT (finetuned, w/o ranking)	Recall@1	97.9	—	Unverified
2	BLIP-2 ViT-G (zero-shot, 1K test set)	Recall@1	97.6	—	Unverified
3	ONE-PEACE (finetuned, w/o ranking)	Recall@1	97.6	—	Unverified
4	InternVL-C-FT (finetuned, w/o ranking)	Recall@1	97.2	—	Unverified
5	BLIP-2 ViT-L (zero-shot, 1K test set)	Recall@1	96.9	—	Unverified
6	ERNIE-ViL 2.0	Recall@1	96.1	—	Unverified
7	ALBEF	Recall@1	95.9	—	Unverified
8	UNITER	Recall@1	87.3	—	Unverified
9	GSMN	Recall@1	76.4	—	Unverified
10	LGSGM	Recall@1	71	—	Unverified