Knowledge Distillation Based on Transformed Teacher Matching

2024-02-17Code Available1· sign in to hype

Kaixiang Zheng, En-hui Yang

Code Available — Be the first to reproduce this paper.

Code

github.com/zkxufo/TTM
OfficialIn paperpytorch★ 12

Abstract

As a technique to bridge logit matching and probability distribution matching, temperature scaling plays a pivotal role in knowledge distillation (KD). Conventionally, temperature scaling is applied to both teacher's logits and student's logits in KD. Motivated by some recent works, in this paper, we drop instead temperature scaling on the student side, and systematically study the resulting variant of KD, dubbed transformed teacher matching (TTM). By reinterpreting temperature scaling as a power transform of probability distribution, we show that in comparison with the original KD, TTM has an inherent R\'enyi entropy term in its objective function, which serves as an extra regularization term. Extensive experiment results demonstrate that thanks to this inherent regularization, TTM leads to trained students with better generalization than the original KD. To further enhance student's capability to match teacher's power transformed probability distribution, we introduce a sample-adaptive weighting coefficient into TTM, yielding a novel distillation approach dubbed weighted TTM (WTTM). It is shown, by comprehensive experiments, that although WTTM is simple, it is effective, improves upon TTM, and achieves state-of-the-art accuracy performance. Our source code is available at https://github.com/zkxufo/TTM.

Tasks

Knowledge Distillation

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
ImageNet	WTTM (T: DeiT III-Small S:DeiT-Tiny)	Top-1 accuracy %	77.03	—	Unverified
ImageNet	WTTM (T:resnet50, S:mobilenet-v1)	Top-1 accuracy %	73.09	—	Unverified
ImageNet	WTTM (T: ResNet-34 S:ResNet-18)	Top-1 accuracy %	72.19	—	Unverified

Knowledge Distillation Based on Transformed Teacher Matching

Code

Abstract

Tasks

Benchmark Results

Reproductions