A Simple Long-Tailed Recognition Baseline via Vision-Language Model

2021-11-29Code Available1· sign in to hype

Teli Ma, Shijie Geng, Mengmeng Wang, Jing Shao, Jiasen Lu, Hongsheng Li, Peng Gao, Yu Qiao

Code Available — Be the first to reproduce this paper.

Code

github.com/gaopengcuhk/ballad
OfficialIn paperpytorch★ 59

Abstract

The visual world naturally exhibits a long-tailed distribution of open classes, which poses great challenges to modern visual systems. Existing approaches either perform class re-balancing strategies or directly improve network modules to address the problem. However, they still train models with a finite set of predefined labels, limiting their supervision information and restricting their transferability to novel instances. Recent advances in large-scale contrastive visual-language pretraining shed light on a new pathway for visual recognition. With open-vocabulary supervisions, pretrained contrastive vision-language models learn powerful multimodal representations that are promising to handle data deficiency and unseen concepts. By calculating the semantic similarity between visual and text inputs, visual recognition is converted to a vision-language matching problem. Inspired by this, we propose BALLAD to leverage contrastive vision-language models for long-tailed recognition. We first continue pretraining the vision-language backbone through contrastive learning on a specific long-tailed target dataset. Afterward, we freeze the backbone and further employ an additional adapter layer to enhance the representations of tail classes on balanced training samples built with re-sampling strategies. Extensive experiments have been conducted on three popular long-tailed recognition benchmarks. As a result, our simple and effective approach sets the new state-of-the-art performances and outperforms competitive baselines with a large margin. Code is released at https://github.com/gaopengcuhk/BALLAD.

Tasks

Contrastive Learning Language Modeling Language Modelling Long-tail Learning Semantic Similarity Semantic Textual Similarity

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
CIFAR-100-LT (ρ=100)	BALLAD (ViT-B/16)	Error Rate	22.2	—	Unverified
ImageNet-LT	BALLAD(ResNet-50×16)	Top-1 Accuracy	76.5	—	Unverified
ImageNet-LT	BALLAD(ViT-B-16)	Top-1 Accuracy	75.7	—	Unverified
ImageNet-LT	BALLAD(ResNet-101)	Top-1 Accuracy	70.5	—	Unverified
ImageNet-LT	BALLAD(ResNet-50)	Top-1 Accuracy	67.2	—	Unverified
Places-LT	BALLAD(ResNet-50×16)	Top-1 Accuracy	49.3	—	Unverified
Places-LT	BALLAD(ResNet-101)	Top-1 Accuracy	47.9	—	Unverified
Places-LT	BALLAD(ResNet-50)	Top-1 Accuracy	46.5	—	Unverified
Places-LT	BALLAD(ViT-B-16)	Top-1 Accuracy	49.5	—	Unverified

A Simple Long-Tailed Recognition Baseline via Vision-Language Model

Code

Abstract

Tasks

Benchmark Results

Reproductions