Deep-to-bottom Weights Decay: A Systemic Knowledge Review Learning Technique for Transformer Layers in Knowledge Distillation

2021-11-16ACL ARR November 2021Unverified0· sign in to hype

Anonymous

Unverified — Be the first to reproduce this paper.

Abstract

There are millions of parameters and huge computational power consumption behind the outstanding performance of pre-trained language models in natural language processing tasks. Knowledge distillation is considered as a compression strategy to address this problem. However, previous works (i) distill partial transformer layers of the teacher model, which ignore the importance of bottom base information, or (ii) neglect the difficulty differences of knowledge from deep to shallow, which corresponds to different level information of teacher model. We introduce a deep-to-bottom weights decay review mechanism to knowledge distillation, which fuses teacher-side information taking each layer’s difficulty level into consideration. To validate our claims, we distill a 12-layer BERT into a 6-layer model and evaluate it on the GLUE dataset. Experimental results show that our review approach is able to outperform other existing techniques.

Tasks

Knowledge Distillation

Deep-to-bottom Weights Decay: A Systemic Knowledge Review Learning Technique for Transformer Layers in Knowledge Distillation

Abstract

Tasks

Reproductions