Pre-training of Lightweight Vision Transformers on Small Datasets with Minimally Scaled Images

2024-02-06Unverified0· sign in to hype

Jen Hong Tan

Unverified — Be the first to reproduce this paper.

Abstract

Can a lightweight Vision Transformer (ViT) match or exceed the performance of Convolutional Neural Networks (CNNs) like ResNet on small datasets with small image resolutions? This report demonstrates that a pure ViT can indeed achieve superior performance through pre-training, using a masked auto-encoder technique with minimal image scaling. Our experiments on the CIFAR-10 and CIFAR-100 datasets involved ViT models with fewer than 3.65 million parameters and a multiply-accumulate (MAC) count below 0.27G, qualifying them as 'lightweight' models. Unlike previous approaches, our method attains state-of-the-art performance among similar lightweight transformer-based architectures without significantly scaling up images from CIFAR-10 and CIFAR-100. This achievement underscores the efficiency of our model, not only in handling small datasets but also in effectively processing images close to their original scale.

Tasks

Image Classification

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
CIFAR-10	ViT (lightweight, MAE pretrained)	Percentage correct	96.41	—	Unverified
CIFAR-100	ViT (lightweight, MAE pre-trained)	Percentage correct	78.27	—	Unverified

Pre-training of Lightweight Vision Transformers on Small Datasets with Minimally Scaled Images

Abstract

Tasks

Benchmark Results

Reproductions