SOTAVerified

Improving Knowledge Distillation via Regularizing Feature Norm and Direction

2023-05-26Code Available1· sign in to hype

Yuzhu Wang, Lechao Cheng, Manni Duan, Yongheng Wang, Zunlei Feng, Shu Kong

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Knowledge distillation (KD) exploits a large well-trained model (i.e., teacher) to train a small student model on the same dataset for the same task. Treating teacher features as knowledge, prevailing methods of knowledge distillation train student by aligning its features with the teacher's, e.g., by minimizing the KL-divergence between their logits or L2 distance between their intermediate features. While it is natural to believe that better alignment of student features to the teacher better distills teacher knowledge, simply forcing this alignment does not directly contribute to the student's performance, e.g., classification accuracy. In this work, we propose to align student features with class-mean of teacher features, where class-mean naturally serves as a strong classifier. To this end, we explore baseline techniques such as adopting the cosine distance based loss to encourage the similarity between student features and their corresponding class-means of the teacher. Moreover, we train the student to produce large-norm features, inspired by other lines of work (e.g., model pruning and domain adaptation), which find the large-norm features to be more significant. Finally, we propose a rather simple loss term (dubbed ND loss) to simultaneously (1) encourage student to produce large-norm features, and (2) align the direction of student features and teacher class-means. Experiments on standard benchmarks demonstrate that our explored techniques help existing KD methods achieve better performance, i.e., higher classification accuracy on ImageNet and CIFAR100 datasets, and higher detection precision on COCO dataset. Importantly, our proposed ND loss helps the most, leading to the state-of-the-art performance on these benchmarks. The source code is available at https://github.com/WangYZ1608/Knowledge-Distillation-via-ND.

Tasks

Benchmark Results

DatasetModelMetricClaimedVerifiedStatus
CIFAR-100ReviewKD++(T:resnet-32x4, S:shufflenet-v2)Top-1 Accuracy (%)77.93Unverified
CIFAR-100ReviewKD++(T:resnet-32x4, S:shufflenet-v1)Top-1 Accuracy (%)77.68Unverified
CIFAR-100DKD++(T:resnet-32x4, S:resnet-8x4)Top-1 Accuracy (%)76.28Unverified
CIFAR-100ReviewKD++(T:WRN-40-2, S:WRN-40-1)Top-1 Accuracy (%)75.66Unverified
CIFAR-100KD++(T:resnet56, S:resnet20)Top-1 Accuracy (%)72.53Unverified
CIFAR-100DKD++(T:resnet50, S:mobilenetv2)Top-1 Accuracy (%)70.82Unverified
COCO 2017 valReviewKD++(T: faster rcnn(resnet101), S:faster rcnn(resnet50))AP@0.561.8Unverified
COCO 2017 valReviewKD++(T: faster rcnn(resnet101), S:faster rcnn(resnet18))AP@0.557.96Unverified
COCO 2017 valReviewKD++(T: faster rcnn(resnet101), S:faster rcnn(mobilenet-v2))AP@0.555.18Unverified
ImageNetKD++(T: regnety-16GF S:ViT-B)Top-1 accuracy %83.6Unverified
ImageNetKD++(T:resnet-152 S:resnet-101)Top-1 accuracy %79.15Unverified
ImageNetKD++(T:resnet-152 S:resnet-50)Top-1 accuracy %77.48Unverified
ImageNetKD++(T:resnet152 S:resnet34)Top-1 accuracy %75.53Unverified
ImageNetReviewKD++(T:resnet50, S:mobilenet-v1)Top-1 accuracy %72.96Unverified
ImageNetKD++(T:resnet-152 S:resnet18)Top-1 accuracy %72.54Unverified
ImageNetKD++(T:renset101 S:resnet18)Top-1 accuracy %72.54Unverified
ImageNetKD++(T:resnet50 S:resnet18)Top-1 accuracy %72.53Unverified
ImageNetKD++(T: ResNet-34 S:ResNet-18)Top-1 accuracy %72.07Unverified
ImageNetKD++(T:ViT-B, S:resnet18)Top-1 accuracy %71.84Unverified
ImageNetKD++(T: ViT-S, S:resnet18)Top-1 accuracy %71.46Unverified

Reproductions