Conditioned and Composed Image Retrieval Combining and Partially Fine-Tuning CLIP-Based Features

2022-06-19CVPRW 2022Code Available1· sign in to hype

Alberto Baldrati, Marco Bertini, Tiberio Uricchio, Alberto del Bimbo

Code Available — Be the first to reproduce this paper.

Code

github.com/abaldrati/clip4cirdemo
Officialpytorch★ 85
github.com/ABaldrati/CLIP4Cir
pytorch★ 194

Abstract

In this paper, we present an approach for conditioned and composed image retrieval based on CLIP features. In this extension of content-based image retrieval (CBIR), an image is combined with a text that provides information regarding user intentions and is relevant for application domains like e-commerce. The proposed method is based on an initial training stage where a simple combination of visual and textual features is used, to fine-tune the CLIP text encoder. Then in a second training stage, we learn a more complex combiner network that merges visual and textual features. Contrastive learning is used in both stages. The proposed approach obtains state-of-the-art performance for conditioned CBIR on the FashionIQ dataset and for composed CBIR on the more recent CIRR dataset.

Tasks

Composed Image Retrieval (CoIR)Content-Based Image Retrieval Contrastive Learning Image Retrieval Retrieval

Conditioned and Composed Image Retrieval Combining and Partially Fine-Tuning CLIP-Based Features

Code

Abstract

Tasks

Reproductions