Are Transformers More Robust Than CNNs?

2021-11-10NeurIPS 2021Code Available1· sign in to hype

Yutong Bai, Jieru Mei, Alan Yuille, Cihang Xie

Code Available — Be the first to reproduce this paper.

Code

github.com/ytongbai/ViTs-vs-CNNs
OfficialIn paperpytorch★ 179

Abstract

Transformer emerges as a powerful tool for visual recognition. In addition to demonstrating competitive performance on a broad range of visual benchmarks, recent works also argue that Transformers are much more robust than Convolutions Neural Networks (CNNs). Nonetheless, surprisingly, we find these conclusions are drawn from unfair experimental settings, where Transformers and CNNs are compared at different scales and are applied with distinct training frameworks. In this paper, we aim to provide the first fair & in-depth comparisons between Transformers and CNNs, focusing on robustness evaluations. With our unified training setup, we first challenge the previous belief that Transformers outshine CNNs when measuring adversarial robustness. More surprisingly, we find CNNs can easily be as robust as Transformers on defending against adversarial attacks, if they properly adopt Transformers' training recipes. While regarding generalization on out-of-distribution samples, we show pre-training on (external) large-scale datasets is not a fundamental request for enabling Transformers to achieve better performance than CNNs. Moreover, our ablations suggest such stronger generalization is largely benefited by the Transformer's self-attention-like architectures per se, rather than by other training setups. We hope this work can help the community better understand and benchmark the robustness of Transformers and CNNs. The code and models are publicly available at https://github.com/ytongbai/ViTs-vs-CNNs.

Tasks

Adversarial Robustness

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
ImageNet	ResNet-50 (SGD, Cosine)	Accuracy	77.4	—	Unverified
ImageNet	ResNet-50 (SGD, Step)	Accuracy	76.9	—	Unverified
ImageNet	DeiT-S (AdamW, Cosine)	Accuracy	76.8	—	Unverified
ImageNet	ResNet-50 (AdamW, Cosine)	Accuracy	76.4	—	Unverified
ImageNet-A	ResNet-50 (AdamW, Cosine)	Accuracy	3.1	—	Unverified
ImageNet-A	DeiT-S (AdamW, Cosine)	Accuracy	12.2	—	Unverified
ImageNet-A	ResNet-50 (SGD, Cosine)	Accuracy	3.3	—	Unverified
ImageNet-A	ResNet-50 (SGD, Step)	Accuracy	3.2	—	Unverified
ImageNet-C	DeiT-S (AdamW, Cosine)	mean Corruption Error (mCE)	48	—	Unverified
ImageNet-C	ResNet-50 (SGD, Cosine)	mean Corruption Error (mCE)	56.9	—	Unverified
ImageNet-C	ResNet-50 (SGD, Step)	mean Corruption Error (mCE)	57.9	—	Unverified
ImageNet-C	ResNet-50 (AdamW, Cosine)	mean Corruption Error (mCE)	59.3	—	Unverified
Stylized ImageNet	DeiT-S (AdamW, Cosine)	Accuracy	13	—	Unverified
Stylized ImageNet	ResNet-50 (SGD, Cosine)	Accuracy	8.4	—	Unverified
Stylized ImageNet	ResNet-50 (SGD, Step)	Accuracy	8.3	—	Unverified
Stylized ImageNet	ResNet-50 (AdamW, Cosine)	Accuracy	8.1	—	Unverified

Are Transformers More Robust Than CNNs?

Code

Abstract

Tasks

Benchmark Results

Reproductions