Distilling Knowledge by Mimicking Features

2020-11-03Code Available0· sign in to hype

Guo-Hua Wang, Yifan Ge, Jianxin Wu

Code Available — Be the first to reproduce this paper.

Code

github.com/DoctorKey/LSHFM.singleclassification
Officialpytorch★ 7
github.com/DoctorKey/LSHFM.multiclassification
Officialpytorch★ 5
github.com/DoctorKey/LSHFM.detection
Officialpytorch★ 5

Abstract

Knowledge distillation (KD) is a popular method to train efficient networks ("student") with the help of high-capacity networks ("teacher"). Traditional methods use the teacher's soft logits as extra supervision to train the student network. In this paper, we argue that it is more advantageous to make the student mimic the teacher's features in the penultimate layer. Not only the student can directly learn more effective information from the teacher feature, feature mimicking can also be applied for teachers trained without a softmax layer. Experiments show that it can achieve higher accuracy than traditional KD. To further facilitate feature mimicking, we decompose a feature vector into the magnitude and the direction. We argue that the teacher should give more freedom to the student feature's magnitude, and let the student pay more attention on mimicking the feature direction. To meet this requirement, we propose a loss term based on locality-sensitive hashing (LSH). With the help of this new loss, our method indeed mimics feature directions more accurately, relaxes constraints on feature magnitudes, and achieves state-of-the-art distillation accuracy. We provide theoretical analyses of how LSH facilitates feature direction mimicking, and further extend feature mimicking to multi-label recognition and object detection.

Tasks

Knowledge Distillation object-detection Object Detection

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
COCO (Common Objects in Context)	LSHFM (T: ResNet101 S: ResNet50)	mAP	77.16	—	Unverified
COCO (Common Objects in Context)	LSHFM (T: ResNet101 S: MobileNetV2)	mAP	73.73	—	Unverified
ImageNet	LSHFM (T: ResNet-34 S:ResNet-18)	Top-1 accuracy %	71.72	—	Unverified
PASCAL VOC	LSHFM (T: ResNet101 S: ResNet50)	mAP	93.17	—	Unverified
PASCAL VOC	LSHFM (T: ResNet101 S: MobileNetV2)	mAP	90.14	—	Unverified

Distilling Knowledge by Mimicking Features

Code

Abstract

Tasks

Benchmark Results

Reproductions