SOTAVerified

Cross-Modal Retrieval

Cross-Modal Retrieval (CMR) is a task of retrieving items across different modalities, such as image, text, video, and audio. The core challenge of CMR is the heterogeneity gap, which arises because data from different modalities have distinct representations, making direct comparison difficult. To address this, most CMR methods focus on learning a shared latent embedding space. In this space, concepts from different modalities are projected, allowing their similarity to be measured using a distance metric.

Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study

Papers

Showing 110 of 522 papers

TitleStatusHype
An analysis of vision-language models for fabric retrieval0
Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval0
Maximal Matching Matters: Preventing Representation Collapse for Robust Cross-Modal Retrieval0
Multimodal Medical Image Binding via Shared Text Embeddings0
FedNano: Toward Lightweight Federated Tuning for Pretrained Multimodal Large Language Models0
ContextRefine-CLIP for EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2025Code0
SA-Person: Text-Based Person Retrieval with Scene-aware Re-ranking0
FOLIAGE: Towards Physical Intelligence World Models Via Unbounded Surface Evolution0
EmotionRankCLAP: Bridging Natural Language Speaking Styles and Ordinal Speech Emotion via Rank-N-Contrast0
DocMMIR: A Framework for Document Multi-modal Information RetrievalCode0
Show:102550
← PrevPage 1 of 53Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1VLPCook (R1M+)Image-to-text R@174.9Unverified
2VLPCookImage-to-text R@173.6Unverified
3T-Food (CLIP)Image-to-text R@172.3Unverified
4T-FoodImage-to-text R@168.2Unverified
5X-MRSImage-to-text R@164Unverified
6H-TImage-to-text R@160Unverified
7SCANImage-to-text R@154Unverified
8ACMEImage-to-text R@151.8Unverified
9VLPCookImage-to-text R@145.2Unverified
10AdaMineImage-to-text R@139.8Unverified