Feature Structure Distillation for BERT Transferring
Anonymous
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
Knowledge distillation is an approach to transfer information on feature representations from a teacher to a student by reducing their difference. A challenge of this approach is to reduce the flexibility of the student's representations inducing inaccurate learning of the teacher's knowledge. To resolve it in BERT transferring, we investigate distillation of structures of representations specified to three types: intra-feature, local inter-feature, global inter-feature structures. To transfer them, we introduce feature structure distillation methods based on the Centered Kernel Alignment, which assigns a consistent value to similar distributions of representations and reveals more informative relations. In particular, a memory-augmented transfer method with clustering is implemented for the global structures. In the experiments on the nine tasks for language understanding of the GLUE dataset, the proposed methods effectively transfer the three types of structures and improve performance compared to state-of-the-art distillation methods.