Towards Full Utilization on Mask Task for Distilling PLMs into NMT

2021-09-17ACL ARR September 2021Unverified0· sign in to hype

Anonymous

Unverified — Be the first to reproduce this paper.

Abstract

Owing to being well-performed in many natural language processing tasks, the application of pre-trained language models (PLMs) in neural machine translation (NMT) is widely concerned. Knowledge distillation (KD) is one of the mainstream methods which could gain considerable promotion for NMT models without extra computational costs. However, previous methods in NMT always distill knowledge at hidden states level and can not make full use of the teacher models. For solving the aforementioned issue, we propose KD based on mask task as a more effective method utilized in NMT which includes encoder input conversion, mask task distillation, and gradient optimization mechanism. Here, we evaluate our translation systems for English→German and Chinese→English tasks and our methods clearly outperform baseline methods. Besides, our framework can get great performances with different PLMs.

Tasks

Knowledge Distillation Machine Translation NMT Translation

Towards Full Utilization on Mask Task for Distilling PLMs into NMT

Abstract

Tasks

Reproductions