SOTAVerified

Unveiling the Power of CLIP in Unsupervised Visible-Infrared Person Re-Identification

2023-10-23journal 2023Code Available1· sign in to hype

Zhong Chen, Zhizhong Zhang, Xin Tan, Yanyun Qu, and Yuan XieAuthors Info & Claims

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Large-scale Vision-Language Pre-training (VLP) model, e.g., CLIP, has demonstrated its natural advantage in generating textual descriptions for images. These textual descriptions afford us greater semantic monitoring insights while not requiring any domain knowledge. In this paper, we propose a new prompt learning paradigm for unsupervised visible-infrared person re-identification (USL-VI-ReID) by taking full advantage of the visual-text representation ability from CLIP. In our framework, we establish a learnable cluster-aware prompt for person images and obtain textual descriptions allowing for subsequent unsupervised training. This description complements the rigid pseudo-labels and provides an important semantic supervised signal. On that basis, we propose a new memory-swapping contrastive learning, where we first find the correlated cross-modal prototypes by the Hungarian matching method and then swap the prototype pairs in the memory. Thus typical contrastive learning without any change could easily associate the cross-modal information. Extensive experiments on the benchmark datasets demonstrate the effectiveness of our method. For example, on SYSU-MM01 we arrive at 54.0% in terms of Rank-1 accuracy, over 9% improvement against state-of-the-art approaches. Code is available at https://github.com/CzAngus/CCLNet.

Tasks

Reproductions