SOTAVerified

Zero-Shot Cross-Modal Retrieval

Zero-Shot Cross-Modal Retrieval is the task of finding relevant items across different modalities without having received any training examples. For example, given an image, find a text or vice versa. This task presents a unique challenge known as the heterogeneity gap, which arises because items from different modalities (such as text and images) have inherently different data types. As a result, measuring similarity between these modalities directly is difficult. To address this, most current approaches aim to bridge the heterogeneity gap by learning a shared latent representation space. In this space, data from different modalities are projected into a common representation, where similarity between items, regardless of modality, can be directly measured.

Source: Extending CLIP for Category-to-image Retrieval in E-commerce

Papers

Showing 125 of 26 papers

TitleStatusHype
AltCLIP: Altering the Language Encoder in CLIP for Extended Language CapabilitiesCode4
Flamingo: a Visual Language Model for Few-Shot LearningCode4
Merlin: A Vision Language Foundation Model for 3D Computed TomographyCode3
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and DatasetCode2
Vision-Language Pre-Training with Triple Contrastive LearningCode2
Learning Transferable Visual Models From Natural Language SupervisionCode2
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text SupervisionCode2
IMPACT: A Large-scale Integrated Multimodal Patent Analysis and Creation Dataset for Design PatentsCode1
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-trainingCode1
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic TasksCode1
Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision TransformersCode1
Position-guided Text Prompt for Vision-Language Pre-trainingCode1
Reproducible scaling laws for contrastive language-image learningCode1
CoCa: Contrastive Captioners are Image-Text Foundation ModelsCode1
Florence: A New Foundation Model for Computer VisionCode1
Align before Fuse: Vision and Language Representation Learning with Momentum DistillationCode1
ViLT: Vision-and-Language Transformer Without Convolution or Region SupervisionCode1
UNITER: UNiversal Image-TExt Representation LearningCode1
FineLIP: Extending CLIP's Reach via Fine-Grained Alignment with Longer Text Inputs0
A Recipe for Improving Remote Sensing VLM Zero Shot Generalization0
M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient PretrainingCode0
Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal AnalysisCode0
ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-trainingCode0
Information-Theoretic Hashing for Zero-Shot Cross-Modal Retrieval0
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language TasksCode0
Show:102550
← PrevPage 1 of 2Next →

No leaderboard results yet.