SOTAVerified

Zero-Shot Cross-Modal Retrieval

Zero-Shot Cross-Modal Retrieval is the task of finding relevant items across different modalities without having received any training examples. For example, given an image, find a text or vice versa. This task presents a unique challenge known as the heterogeneity gap, which arises because items from different modalities (such as text and images) have inherently different data types. As a result, measuring similarity between these modalities directly is difficult. To address this, most current approaches aim to bridge the heterogeneity gap by learning a shared latent representation space. In this space, data from different modalities are projected into a common representation, where similarity between items, regardless of modality, can be directly measured.

Source: Extending CLIP for Category-to-image Retrieval in E-commerce

Title	Date	Tasks	Status	Hype
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation	Jul 16, 2021	Cross-Modal RetrievalGrounded language learning	CodeCode Available	1
Learning Transferable Visual Models From Natural Language Supervision	Feb 26, 2021	Action RecognitionBenchmarking	CodeCode Available	2
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision	Feb 11, 2021	Cross-Modal RetrievalFine-Grained Image Classification	CodeCode Available	2
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision	Feb 5, 2021	Cross-Modal RetrievalImage Retrieval	CodeCode Available	1
ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data	Jan 22, 2020	Image RetrievalImage-text matching	—Unverified	0
UNITER: UNiversal Image-TExt Representation Learning	Sep 25, 2019	Image-text matchingImage-text Retrieval	CodeCode Available	1

Title

Status

Hype

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

CodeCode Available

Learning Transferable Visual Models From Natural Language Supervision

CodeCode Available

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

CodeCode Available

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

CodeCode Available

ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data

—Unverified

UNITER: UNiversal Image-TExt Representation Learning

CodeCode Available

No leaderboard results yet.

Zero-Shot Cross-Modal Retrieval

Papers