Cross-Modal Retrieval

Cross-Modal Retrieval (CMR) is a task of retrieving items across different modalities, such as image, text, video, and audio. The core challenge of CMR is the heterogeneity gap, which arises because data from different modalities have distinct representations, making direct comparison difficult. To address this, most CMR methods focus on learning a shared latent embedding space. In this space, concepts from different modalities are projected, allowing their similarity to be measured using a distance metric.

Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 51–75 of 522 papers

Title	Date	Tasks	Status	Hype	Score
IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages	Jan 27, 2022	Cross-Modal RetrievalFew-Shot Learning	CodeCode Available	1	5
BirdSAT: Cross-View Contrastive Masked Autoencoders for Bird Species Classification and Mapping	Oct 29, 2023	Contrastive LearningCross-Modal Retrieval	CodeCode Available	1	5
Adaptive label-aware graph convolutional networks for cross-modal retrieval	Aug 6, 2021	Cross-Modal RetrievalRepresentation Learning	CodeCode Available	1	5
CaLa: Complementary Association Learning for Augmenting Composed Image Retrieval	May 29, 2024	Cross-Modal RetrievalImage Retrieval	CodeCode Available	1	5
Deep Evidential Learning with Noisy Correspondence for Cross-Modal Retrieval	Oct 10, 2022	Cross-Modal RetrievalCross-modal retrieval with noisy correspondence	CodeCode Available	1	5
Domain-Smoothing Network for Zero-Shot Sketch-Based Image Retrieval	Jun 22, 2021	Cross-Modal RetrievalDiversity	CodeCode Available	1	5
Fuzzy Multimodal Learning for Trusted Cross-modal Retrieval	Jan 1, 2025	Cross-Modal RetrievalRetrieval	CodeCode Available	1	5
Enhancing Recipe Retrieval with Foundation Models: A Data Augmentation Perspective	Dec 8, 2023	Cross-Modal RetrievalData Augmentation	CodeCode Available	1	5
A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language	Sep 12, 2022	Contrastive LearningCross-Modal Retrieval	CodeCode Available	1	5
Disentangling and Generating Modalities for Recommendation in Missing Modality Scenarios	Apr 23, 2025	Cross-Modal RetrievalRecommendation Systems	CodeCode Available	1	5
Distinctive Image Captioning: Leveraging Ground Truth Captions in CLIP Guided Reinforcement Learning	Feb 21, 2024	Cross-Modal RetrievalImage Captioning	CodeCode Available	1	5
Learning Modal-Invariant and Temporal-Memory for Video-based Visible-Infrared Person Re-Identification	Aug 4, 2022	Cross-Modal RetrievalPerson Re-Identification	CodeCode Available	1	5
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation	Jul 16, 2021	Cross-Modal RetrievalGrounded language learning	CodeCode Available	1	5
A Survey on Interpretable Cross-modal Reasoning	Sep 5, 2023	Cross-Modal RetrievalDecision Making	CodeCode Available	1	5
Florence: A New Foundation Model for Computer Vision	Nov 22, 2021	Action ClassificationAction Recognition	CodeCode Available	1	5
Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning	Mar 1, 2020	Cross-Modal RetrievalRetrieval	CodeCode Available	1	5
Cross-modal transformers for infrared and visible image fusion	Jun 26, 2023	Cross-Modal RetrievalDepth Estimation	CodeCode Available	1	5
Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders	Aug 12, 2020	Cross-Modal Information RetrievalCross-Modal Retrieval	CodeCode Available	1	5
Fusion and Orthogonal Projection for Improved Face-Voice Association	Dec 20, 2021	Cross-Modal RetrievalTriplet	CodeCode Available	1	5
Image-text Retrieval via Preserving Main Semantics of Vision	Apr 20, 2023	Cross-Modal RetrievalImage-text Retrieval	CodeCode Available	1	5
Cross-Modal Retrieval with Partially Mismatched Pairs	Feb 22, 2023	Contrastive LearningCross-Modal Retrieval	CodeCode Available	1	5
Cross-modal Retrieval with Noisy Correspondence via Consistency Refining and Mining	Mar 25, 2024	Cross-Modal RetrievalCross-modal retrieval with noisy correspondence	CodeCode Available	1	5
FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks	Mar 4, 2023	Cross-Modal RetrievalImage Captioning	CodeCode Available	1	5
Cross-Modal Retrieval: A Systematic Review of Methods and Future Directions	Aug 28, 2023	Cross-Modal RetrievalRetrieval	CodeCode Available	1	5
Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image Retrieval	Oct 19, 2022	Cross-Modal RetrievalImage Retrieval	CodeCode Available	1	5

Show:10 25 50

← PrevPage 3 of 21Next →

All datasets COCO 2014 Flickr30k Recipe1M+RSICD RSITMD ChEBI-20 MSCOCO-1k SoundingEarth CUHK-PEDES Flickr-8k MSCOCO MS-COCO-2014

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	MaMMUT (ours)	Image-to-text R@1	70.7	—	Unverified
2	VAST	Text-to-image R@1	68	—	Unverified
3	X2-VLM (large)	Text-to-image R@1	67.7	—	Unverified
4	BEiT-3	Text-to-image R@1	67.2	—	Unverified
5	XFM (base)	Text-to-image R@1	67	—	Unverified
6	X2-VLM (base)	Text-to-image R@1	66.2	—	Unverified
7	PTP-BLIP (14M)	Text-to-image R@1	64.9	—	Unverified
8	OmniVL (14M)	Text-to-image R@1	64.8	—	Unverified
9	VSE-Gradient	Text-to-image R@1	63.6	—	Unverified
10	X-VLM (base)	Text-to-image R@1	63.4	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	X2-VLM (large)	Image-to-text R@1	98.8	—	Unverified
2	X2-VLM (base)	Image-to-text R@1	98.5	—	Unverified
3	BEiT-3	Image-to-text R@1	98	—	Unverified
4	OmniVL (14M)	Image-to-text R@1	97.3	—	Unverified
5	Aurora (ours, r=128)	Image-to-text R@1	97.2	—	Unverified
6	ERNIE-ViL 2.0	Image-to-text R@1	97.2	—	Unverified
7	X-VLM (base)	Image-to-text R@1	97.1	—	Unverified
8	VSE-Gradient	Image-to-text R@1	97	—	Unverified
9	ALIGN	Image-to-text R@1	95.3	—	Unverified
10	VAST	Text-to-image R@1	91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	VLPCook (R1M+)	Image-to-text R@1	74.9	—	Unverified
2	VLPCook	Image-to-text R@1	73.6	—	Unverified
3	T-Food (CLIP)	Image-to-text R@1	72.3	—	Unverified
4	T-Food	Image-to-text R@1	68.2	—	Unverified
5	X-MRS	Image-to-text R@1	64	—	Unverified
6	H-T	Image-to-text R@1	60	—	Unverified
7	SCAN	Image-to-text R@1	54	—	Unverified
8	ACME	Image-to-text R@1	51.8	—	Unverified
9	VLPCook	Image-to-text R@1	45.2	—	Unverified
10	AdaMine	Image-to-text R@1	39.8	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	HarMA (w/ GeoRSCLIP)	Mean Recall	38.95	—	Unverified
2	GeoRSCLIP-FT	Mean Recall	38.87	—	Unverified
3	GLISA	Mean Recall	37.69	—	Unverified
4	RemoteCLIP	Mean Recall	36.35	—	Unverified
5	PE-RSITR (MRS-Adapter)	Mean Recall	31.12	—	Unverified
6	PIR	Mean Recall	24.46	—	Unverified
7	DOVE	Mean Recall	22.72	—	Unverified
8	SWAN	Mean Recall	20.61	—	Unverified
9	GaLR	Mean Recall	18.96	—	Unverified
10	AMFMN	Mean Recall	15.53	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	HarMA (w/ GeoRSCLIP)	Image-to-text R@1	32.74	—	Unverified
2	GeoRSCLIP-FT	Image-to-text R@1	32.3	—	Unverified
3	GLISA	Image-to-text R@1	32.08	—	Unverified
4	RemoteCLIP	Image-to-text R@1	28.76	—	Unverified
5	PE-RSITR (MRS-Adapter)	Image-to-text R@1	23.67	—	Unverified
6	PIR	Image-to-text R@1	18.14	—	Unverified
7	DOVE	Image-to-text R@1	16.81	—	Unverified
8	GaLR	Image-to-text R@1	14.82	—	Unverified
9	SWAN	Image-to-text R@1	13.35	—	Unverified
10	AMFMN	Image-to-text R@1	10.63	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	CLASS (ORMA)	Hits@1	67.4	—	Unverified
2	ORMA	Hits@1	66.5	—	Unverified
3	Song et al.	Hits@1	56.5	—	Unverified
4	CLASS (AMAN)	Hits@1	51.1	—	Unverified
5	DSOKR	Hits@1	51	—	Unverified
6	AMAN	Hits@1	49.4	—	Unverified
7	All-Ensemble	Hits@1	34.4	—	Unverified
8	MLP1	Hits@1	22.4	—	Unverified
9	GCN2	Hits@1	22.3	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	NAPReg	Image-to-text R@1	81.9	—	Unverified
2	Dual-path CNN	Image-to-text R@1	41.2	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	ResNet-18	Median Rank	565	—	Unverified
2	GeoCLAP	Median Rank	159	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Dual Path	Text-to-image Medr	2	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	NAPReg	Image-to-text R@1	56.2	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	3SHNet	Image-to-text R@1	85.8	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	NAPReg	Text-to-image R@1	43	—	Unverified