Cross-Modal Retrieval

Cross-Modal Retrieval (CMR) is a task of retrieving items across different modalities, such as image, text, video, and audio. The core challenge of CMR is the heterogeneity gap, which arises because data from different modalities have distinct representations, making direct comparison difficult. To address this, most CMR methods focus on learning a shared latent embedding space. In this space, concepts from different modalities are projected, allowing their similarity to be measured using a distance metric.

Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 26–50 of 522 papers

Title	Date	Tasks	Status	Hype
FineLIP: Extending CLIP's Reach via Fine-Grained Alignment with Longer Text Inputs	Apr 2, 2025	cross-modal alignmentCross-Modal Retrieval	—Unverified	0
LRSCLIP: A Vision-Language Foundation Model for Aligning Remote Sensing Image with Longer Text	Mar 25, 2025	Cross-Modal RetrievalHallucination	CodeCode Available	1
Seeing Speech and Sound: Distinguishing and Locating Audios in Visual Scenes	Mar 24, 2025	Cross-Modal RetrievalDisentanglement	—Unverified	0
PromptHash: Affinity-Prompted Collaborative Cross-Modal Learning for Adaptive Hashing Retrieval	Mar 20, 2025	Contrastive LearningCross-Modal Retrieval	CodeCode Available	0
Derm1M: A Million-scale Vision-Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology	Mar 19, 2025	Cross-Modal RetrievalDiagnostic	CodeCode Available	2
NeighborRetr: Balancing Hub Centrality in Cross-Modal Retrieval	Mar 13, 2025	Cross-Modal RetrievalRetrieval	CodeCode Available	0
Adaptive Inner Speech-Text Alignment for LLM-based Speech Translation	Mar 13, 2025	Cross-Modal RetrievalTranslation	—Unverified	0
Astrea: A MOE-based Visual Understanding Model with Progressive Alignment	Mar 12, 2025	Contrastive LearningCross-Modal Retrieval	—Unverified	0
A Recipe for Improving Remote Sensing VLM Zero Shot Generalization	Mar 10, 2025	Cross-Modal RetrievalZero-Shot Cross-Modal Retrieval	—Unverified	0
X2CT-CLIP: Enable Multi-Abnormality Detection in Computed Tomography from Chest Radiography via Tri-Modal Contrastive Learning	Mar 4, 2025	Anomaly DetectionComputed Tomography (CT)	—Unverified	0
Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval	Mar 3, 2025	Cross-Modal RetrievalRetrieval	CodeCode Available	1
Composed Multi-modal Retrieval: A Survey of Approaches and Applications	Mar 3, 2025	Cross-Modal RetrievalData Augmentation	CodeCode Available	2
Lightweight Contrastive Distilled Hashing for Online Cross-modal Retrieval	Feb 27, 2025	Cross-Modal RetrievalKnowledge Distillation	—Unverified	0
ReCon: Enhancing True Correspondence Discrimination through Relation Consistency for Robust Noisy Correspondence Learning	Feb 27, 2025	Cross-Modal RetrievalCross-modal retrieval with noisy correspondence	CodeCode Available	1
Pathology Report Generation and Multimodal Representation Learning for Cutaneous Melanocytic Lesions	Feb 26, 2025	Cross-Modal RetrievalLanguage Modeling	—Unverified	0
On the Importance of Text Preprocessing for Multimodal Representation Learning and Pathology Report Generation	Feb 26, 2025	Cross-Modal RetrievalHallucination	—Unverified	0
CLASS: Enhancing Cross-Modal Text-Molecule Retrieval Performance and Training Efficiency	Feb 17, 2025	Cross-Modal RetrievalRetrieval	—Unverified	0
GAIA: A Global, Multi-modal, Multi-scale Vision-Language Dataset for Remote Sensing Image Analysis	Feb 13, 2025	Cross-Modal RetrievalImage Captioning	CodeCode Available	1
Zero-Shot Interactive Text-to-Image Retrieval via Diffusion-Augmented Representations	Jan 26, 2025	Cross-Modal RetrievalImage Retrieval	—Unverified	0
TSVC:Tripartite Learning with Semantic Variation Consistency for Robust Image-Text Retrieval	Jan 19, 2025	Cross-Modal RetrievalImage-text Retrieval	—Unverified	0
Deep Reversible Consistency Learning for Cross-modal Retrieval	Jan 10, 2025	Cross-Modal RetrievalRepresentation Learning	CodeCode Available	0
Robust Self-Paced Hashing for Cross-Modal Retrieval with Noisy Labels	Jan 3, 2025	Computational EfficiencyCross-Modal Retrieval	CodeCode Available	1
Seeing Speech and Sound: Distinguishing and Locating Audio Sources in Visual Scenes	Jan 1, 2025	Cross-Modal RetrievalDisentanglement	—Unverified	0
Incorporating Dense Knowledge Alignment into Unified Multimodal Representation Models	Jan 1, 2025	Contrastive LearningCross-Modal Retrieval	—Unverified	0
Fuzzy Multimodal Learning for Trusted Cross-modal Retrieval	Jan 1, 2025	Cross-Modal RetrievalRetrieval	CodeCode Available	1

Show:10 25 50

← PrevPage 2 of 21Next →

All datasets COCO 2014 Flickr30k Recipe1M+RSICD RSITMD ChEBI-20 MSCOCO-1k SoundingEarth CUHK-PEDES Flickr-8k MSCOCO MS-COCO-2014

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	MaMMUT (ours)	Image-to-text R@1	70.7	—	Unverified
2	VAST	Text-to-image R@1	68	—	Unverified
3	X2-VLM (large)	Text-to-image R@1	67.7	—	Unverified
4	BEiT-3	Text-to-image R@1	67.2	—	Unverified
5	XFM (base)	Text-to-image R@1	67	—	Unverified
6	X2-VLM (base)	Text-to-image R@1	66.2	—	Unverified
7	PTP-BLIP (14M)	Text-to-image R@1	64.9	—	Unverified
8	OmniVL (14M)	Text-to-image R@1	64.8	—	Unverified
9	VSE-Gradient	Text-to-image R@1	63.6	—	Unverified
10	X-VLM (base)	Text-to-image R@1	63.4	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	X2-VLM (large)	Image-to-text R@1	98.8	—	Unverified
2	X2-VLM (base)	Image-to-text R@1	98.5	—	Unverified
3	BEiT-3	Image-to-text R@1	98	—	Unverified
4	OmniVL (14M)	Image-to-text R@1	97.3	—	Unverified
5	Aurora (ours, r=128)	Image-to-text R@1	97.2	—	Unverified
6	ERNIE-ViL 2.0	Image-to-text R@1	97.2	—	Unverified
7	X-VLM (base)	Image-to-text R@1	97.1	—	Unverified
8	VSE-Gradient	Image-to-text R@1	97	—	Unverified
9	ALIGN	Image-to-text R@1	95.3	—	Unverified
10	VAST	Text-to-image R@1	91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	VLPCook (R1M+)	Image-to-text R@1	74.9	—	Unverified
2	VLPCook	Image-to-text R@1	73.6	—	Unverified
3	T-Food (CLIP)	Image-to-text R@1	72.3	—	Unverified
4	T-Food	Image-to-text R@1	68.2	—	Unverified
5	X-MRS	Image-to-text R@1	64	—	Unverified
6	H-T	Image-to-text R@1	60	—	Unverified
7	SCAN	Image-to-text R@1	54	—	Unverified
8	ACME	Image-to-text R@1	51.8	—	Unverified
9	VLPCook	Image-to-text R@1	45.2	—	Unverified
10	AdaMine	Image-to-text R@1	39.8	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	HarMA (w/ GeoRSCLIP)	Mean Recall	38.95	—	Unverified
2	GeoRSCLIP-FT	Mean Recall	38.87	—	Unverified
3	GLISA	Mean Recall	37.69	—	Unverified
4	RemoteCLIP	Mean Recall	36.35	—	Unverified
5	PE-RSITR (MRS-Adapter)	Mean Recall	31.12	—	Unverified
6	PIR	Mean Recall	24.46	—	Unverified
7	DOVE	Mean Recall	22.72	—	Unverified
8	SWAN	Mean Recall	20.61	—	Unverified
9	GaLR	Mean Recall	18.96	—	Unverified
10	AMFMN	Mean Recall	15.53	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	HarMA (w/ GeoRSCLIP)	Image-to-text R@1	32.74	—	Unverified
2	GeoRSCLIP-FT	Image-to-text R@1	32.3	—	Unverified
3	GLISA	Image-to-text R@1	32.08	—	Unverified
4	RemoteCLIP	Image-to-text R@1	28.76	—	Unverified
5	PE-RSITR (MRS-Adapter)	Image-to-text R@1	23.67	—	Unverified
6	PIR	Image-to-text R@1	18.14	—	Unverified
7	DOVE	Image-to-text R@1	16.81	—	Unverified
8	GaLR	Image-to-text R@1	14.82	—	Unverified
9	SWAN	Image-to-text R@1	13.35	—	Unverified
10	AMFMN	Image-to-text R@1	10.63	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	CLASS (ORMA)	Hits@1	67.4	—	Unverified
2	ORMA	Hits@1	66.5	—	Unverified
3	Song et al.	Hits@1	56.5	—	Unverified
4	CLASS (AMAN)	Hits@1	51.1	—	Unverified
5	DSOKR	Hits@1	51	—	Unverified
6	AMAN	Hits@1	49.4	—	Unverified
7	All-Ensemble	Hits@1	34.4	—	Unverified
8	MLP1	Hits@1	22.4	—	Unverified
9	GCN2	Hits@1	22.3	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	NAPReg	Image-to-text R@1	81.9	—	Unverified
2	Dual-path CNN	Image-to-text R@1	41.2	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	ResNet-18	Median Rank	565	—	Unverified
2	GeoCLAP	Median Rank	159	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Dual Path	Text-to-image Medr	2	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	NAPReg	Image-to-text R@1	56.2	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	3SHNet	Image-to-text R@1	85.8	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	NAPReg	Text-to-image R@1	43	—	Unverified