Weight Averaging Improves Knowledge Distillation under Domain Shift

2023-09-20Code Available1· sign in to hype

Valeriy Berezovskiy, Nikita Morozov

Code Available — Be the first to reproduce this paper.

Code

github.com/vorobeevich/distillation-in-dg
OfficialIn paperpytorch★ 19

Abstract

Knowledge distillation (KD) is a powerful model compression technique broadly used in practical deep learning applications. It is focused on training a small student network to mimic a larger teacher network. While it is widely known that KD can offer an improvement to student generalization in i.i.d setting, its performance under domain shift, i.e. the performance of student networks on data from domains unseen during training, has received little attention in the literature. In this paper we make a step towards bridging the research fields of knowledge distillation and domain generalization. We show that weight averaging techniques proposed in domain generalization literature, such as SWAD and SMA, also improve the performance of knowledge distillation under domain shift. In addition, we propose a simplistic weight averaging strategy that does not require evaluation on validation data during training and show that it performs on par with SWAD and SMA when applied to KD. We name our final distillation approach Weight-Averaged Knowledge Distillation (WAKD).

Tasks

Domain Generalization Knowledge Distillation

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
Office-Home	WAKD (DeiT-Ti)	Average Accuracy	70.5	—	Unverified
Office-Home	WAKD (Resnet-18)	Average Accuracy	66.7	—	Unverified
PACS	WAKD (DeiT-Ti)	Average Accuracy	87.6	—	Unverified
PACS	WAKD (Resnet-18)	Average Accuracy	86.6	—	Unverified

Weight Averaging Improves Knowledge Distillation under Domain Shift

Code

Abstract

Tasks

Benchmark Results

Reproductions