SOTAVerified

Cross-Modal Retrieval

Cross-Modal Retrieval (CMR) is a task of retrieving items across different modalities, such as image, text, video, and audio. The core challenge of CMR is the heterogeneity gap, which arises because data from different modalities have distinct representations, making direct comparison difficult. To address this, most CMR methods focus on learning a shared latent embedding space. In this space, concepts from different modalities are projected, allowing their similarity to be measured using a distance metric.

Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study

Papers

Showing 150 of 522 papers

TitleStatusHype
An analysis of vision-language models for fabric retrieval0
Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval0
Maximal Matching Matters: Preventing Representation Collapse for Robust Cross-Modal Retrieval0
Multimodal Medical Image Binding via Shared Text Embeddings0
FedNano: Toward Lightweight Federated Tuning for Pretrained Multimodal Large Language Models0
ContextRefine-CLIP for EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2025Code0
SA-Person: Text-Based Person Retrieval with Scene-aware Re-ranking0
FOLIAGE: Towards Physical Intelligence World Models Via Unbounded Surface Evolution0
EmotionRankCLAP: Bridging Natural Language Speaking Styles and Ordinal Speech Emotion via Rank-N-Contrast0
DocMMIR: A Framework for Document Multi-modal Information RetrievalCode0
Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping0
SMOTExT: SMOTE meets Large Language ModelsCode0
GMM-Based Comprehensive Feature Extraction and Relative Distance Preservation For Few-Shot Cross-Modal Retrieval0
CellCLIP -- Learning Perturbation Effects in Cell Painting via Text-Guided Contrastive Learning0
Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert ReasonerCode2
Towards Cross-modal Retrieval in Chinese Cultural Heritage Documents: Dataset and Solution0
MAKE: Multi-Aspect Knowledge-Enhanced Vision-Language Pretraining for Zero-shot Dermatological AssessmentCode1
OMGM: Orchestrate Multiple Granularities and Modalities for Efficient Multimodal Retrieval0
Probabilistic Embeddings for Frozen Vision-Language Models: Uncertainty Quantification with Gaussian Process Latent Variable ModelsCode0
Disentangling and Generating Modalities for Recommendation in Missing Modality ScenariosCode1
Improving Sound Source Localization with Joint Slot Attention on Image and Audio0
The 1st EReL@MIR Workshop on Efficient Representation Learning for Multimodal Information Retrieval0
SemCORE: A Semantic-Enhanced Generative Cross-Modal Retrieval Framework with MLLMs0
PATFinger: Prompt-Adapted Transferable Fingerprinting against Unauthorized Multimodal Dataset Usage0
Learning Sparse Disentangled Representations for Multimodal Exclusion Retrieval0
FineLIP: Extending CLIP's Reach via Fine-Grained Alignment with Longer Text Inputs0
LRSCLIP: A Vision-Language Foundation Model for Aligning Remote Sensing Image with Longer TextCode1
Seeing Speech and Sound: Distinguishing and Locating Audios in Visual Scenes0
PromptHash: Affinity-Prompted Collaborative Cross-Modal Learning for Adaptive Hashing RetrievalCode0
Derm1M: A Million-scale Vision-Language Dataset Aligned with Clinical Ontology Knowledge for DermatologyCode2
NeighborRetr: Balancing Hub Centrality in Cross-Modal RetrievalCode0
Adaptive Inner Speech-Text Alignment for LLM-based Speech Translation0
Astrea: A MOE-based Visual Understanding Model with Progressive Alignment0
A Recipe for Improving Remote Sensing VLM Zero Shot Generalization0
X2CT-CLIP: Enable Multi-Abnormality Detection in Computed Tomography from Chest Radiography via Tri-Modal Contrastive Learning0
Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document RetrievalCode1
Composed Multi-modal Retrieval: A Survey of Approaches and ApplicationsCode2
Lightweight Contrastive Distilled Hashing for Online Cross-modal Retrieval0
ReCon: Enhancing True Correspondence Discrimination through Relation Consistency for Robust Noisy Correspondence LearningCode1
Pathology Report Generation and Multimodal Representation Learning for Cutaneous Melanocytic Lesions0
On the Importance of Text Preprocessing for Multimodal Representation Learning and Pathology Report Generation0
CLASS: Enhancing Cross-Modal Text-Molecule Retrieval Performance and Training Efficiency0
GAIA: A Global, Multi-modal, Multi-scale Vision-Language Dataset for Remote Sensing Image AnalysisCode1
Zero-Shot Interactive Text-to-Image Retrieval via Diffusion-Augmented Representations0
TSVC:Tripartite Learning with Semantic Variation Consistency for Robust Image-Text Retrieval0
Deep Reversible Consistency Learning for Cross-modal RetrievalCode0
Robust Self-Paced Hashing for Cross-Modal Retrieval with Noisy LabelsCode1
Seeing Speech and Sound: Distinguishing and Locating Audio Sources in Visual Scenes0
Incorporating Dense Knowledge Alignment into Unified Multimodal Representation Models0
Fuzzy Multimodal Learning for Trusted Cross-modal RetrievalCode1
Show:102550
← PrevPage 1 of 11Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1MaMMUT (ours)Image-to-text R@170.7Unverified
2VASTText-to-image R@168Unverified
3X2-VLM (large)Text-to-image R@167.7Unverified
4BEiT-3Text-to-image R@167.2Unverified
5XFM (base)Text-to-image R@167Unverified
6X2-VLM (base)Text-to-image R@166.2Unverified
7PTP-BLIP (14M)Text-to-image R@164.9Unverified
8OmniVL (14M)Text-to-image R@164.8Unverified
9VSE-GradientText-to-image R@163.6Unverified
10X-VLM (base)Text-to-image R@163.4Unverified
#ModelMetricClaimedVerifiedStatus
1X2-VLM (large)Image-to-text R@198.8Unverified
2X2-VLM (base)Image-to-text R@198.5Unverified
3BEiT-3Image-to-text R@198Unverified
4OmniVL (14M)Image-to-text R@197.3Unverified
5ERNIE-ViL 2.0Image-to-text R@197.2Unverified
6Aurora (ours, r=128)Image-to-text R@197.2Unverified
7X-VLM (base)Image-to-text R@197.1Unverified
8VSE-GradientImage-to-text R@197Unverified
9ALIGNImage-to-text R@195.3Unverified
10VASTText-to-image R@191Unverified
#ModelMetricClaimedVerifiedStatus
1VLPCook (R1M+)Image-to-text R@174.9Unverified
2VLPCookImage-to-text R@173.6Unverified
3T-Food (CLIP)Image-to-text R@172.3Unverified
4T-FoodImage-to-text R@168.2Unverified
5X-MRSImage-to-text R@164Unverified
6H-TImage-to-text R@160Unverified
7SCANImage-to-text R@154Unverified
8ACMEImage-to-text R@151.8Unverified
9VLPCookImage-to-text R@145.2Unverified
10AdaMineImage-to-text R@139.8Unverified
#ModelMetricClaimedVerifiedStatus
1HarMA (w/ GeoRSCLIP)Mean Recall38.95Unverified
2GeoRSCLIP-FTMean Recall38.87Unverified
3GLISAMean Recall37.69Unverified
4RemoteCLIPMean Recall36.35Unverified
5PE-RSITR (MRS-Adapter)Mean Recall31.12Unverified
6PIRMean Recall24.46Unverified
7DOVEMean Recall22.72Unverified
8SWANMean Recall20.61Unverified
9GaLRMean Recall18.96Unverified
10AMFMNMean Recall15.53Unverified
#ModelMetricClaimedVerifiedStatus
1HarMA (w/ GeoRSCLIP)Image-to-text R@132.74Unverified
2GeoRSCLIP-FTImage-to-text R@132.3Unverified
3GLISAImage-to-text R@132.08Unverified
4RemoteCLIPImage-to-text R@128.76Unverified
5PE-RSITR (MRS-Adapter)Image-to-text R@123.67Unverified
6PIRImage-to-text R@118.14Unverified
7DOVEImage-to-text R@116.81Unverified
8GaLRImage-to-text R@114.82Unverified
9SWANImage-to-text R@113.35Unverified
10AMFMNImage-to-text R@110.63Unverified
#ModelMetricClaimedVerifiedStatus
1CLASS (ORMA)Hits@167.4Unverified
2ORMAHits@166.5Unverified
3Song et al.Hits@156.5Unverified
4CLASS (AMAN)Hits@151.1Unverified
5DSOKRHits@151Unverified
6AMANHits@149.4Unverified
7All-EnsembleHits@134.4Unverified
8MLP1Hits@122.4Unverified
9GCN2Hits@122.3Unverified
#ModelMetricClaimedVerifiedStatus
1NAPRegImage-to-text R@181.9Unverified
2Dual-path CNNImage-to-text R@141.2Unverified
#ModelMetricClaimedVerifiedStatus
1ResNet-18Median Rank565Unverified
2GeoCLAPMedian Rank159Unverified
#ModelMetricClaimedVerifiedStatus
1Dual PathText-to-image Medr2Unverified
#ModelMetricClaimedVerifiedStatus
1NAPRegImage-to-text R@156.2Unverified
#ModelMetricClaimedVerifiedStatus
13SHNetImage-to-text R@185.8Unverified
#ModelMetricClaimedVerifiedStatus
1NAPRegText-to-image R@143Unverified