SOTAVerified

Cross-Modal Retrieval

Cross-Modal Retrieval (CMR) is a task of retrieving items across different modalities, such as image, text, video, and audio. The core challenge of CMR is the heterogeneity gap, which arises because data from different modalities have distinct representations, making direct comparison difficult. To address this, most CMR methods focus on learning a shared latent embedding space. In this space, concepts from different modalities are projected, allowing their similarity to be measured using a distance metric.

Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study

Papers

Showing 110 of 522 papers

TitleStatusHype
An analysis of vision-language models for fabric retrieval0
Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval0
Maximal Matching Matters: Preventing Representation Collapse for Robust Cross-Modal Retrieval0
Multimodal Medical Image Binding via Shared Text Embeddings0
FedNano: Toward Lightweight Federated Tuning for Pretrained Multimodal Large Language Models0
ContextRefine-CLIP for EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2025Code0
SA-Person: Text-Based Person Retrieval with Scene-aware Re-ranking0
FOLIAGE: Towards Physical Intelligence World Models Via Unbounded Surface Evolution0
EmotionRankCLAP: Bridging Natural Language Speaking Styles and Ordinal Speech Emotion via Rank-N-Contrast0
DocMMIR: A Framework for Document Multi-modal Information RetrievalCode0
Show:102550
← PrevPage 1 of 53Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1NAPRegText-to-image R@143Unverified