Exploring Localization for Self-supervised Fine-grained Contrastive Learning

2021-06-30Unverified0· sign in to hype

Di wu, Siyuan Li, Zelin Zang, Stan Z. Li

Unverified — Be the first to reproduce this paper.

Abstract

Self-supervised contrastive learning has demonstrated great potential in learning visual representations. Despite their success in various downstream tasks such as image classification and object detection, self-supervised pre-training for fine-grained scenarios is not fully explored. We point out that current contrastive methods are prone to memorizing background/foreground texture and therefore have a limitation in localizing the foreground object. Analysis suggests that learning to extract discriminative texture information and localization are equally crucial for fine-grained self-supervised pre-training. Based on our findings, we introduce cross-view saliency alignment (CVSA), a contrastive learning framework that first crops and swaps saliency regions of images as a novel view generation and then guides the model to localize on foreground objects via a cross-view alignment loss. Extensive experiments on both small- and large-scale fine-grained classification benchmarks show that CVSA significantly improves the learned representation.

Tasks

Contrastive Learning Fine-Grained Image Classification Fine-Grained Image Recognition image-classification Image Classification Object Detection

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
CUB-200-2011	BYOL+CVSA (ResNet-50)	Accuracy	77.1	—	Unverified
FGVC-Aircraft	BYOL+CVSA (ResNet-50)	Accuracy	87.27	—	Unverified
NABirds	BYOL+CVSA (ResNet-50)	Accuracy	79.64	—	Unverified
Stanford Cars	BYOL+CVSA (ResNet-50)	Accuracy	89.76	—	Unverified

Exploring Localization for Self-supervised Fine-grained Contrastive Learning

Abstract

Tasks

Benchmark Results

Reproductions