SOTAVerified

Zero-Shot Cross-Modal Retrieval

Zero-Shot Cross-Modal Retrieval is the task of finding relevant items across different modalities without having received any training examples. For example, given an image, find a text or vice versa. This task presents a unique challenge known as the heterogeneity gap, which arises because items from different modalities (such as text and images) have inherently different data types. As a result, measuring similarity between these modalities directly is difficult. To address this, most current approaches aim to bridge the heterogeneity gap by learning a shared latent representation space. In this space, data from different modalities are projected into a common representation, where similarity between items, regardless of modality, can be directly measured.

Source: Extending CLIP for Category-to-image Retrieval in E-commerce

Papers

Showing 2626 of 26 papers

TitleStatusHype
ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data0
Show:102550
← PrevPage 2 of 2Next →

No leaderboard results yet.