Gradient Sparsification For Masked Fine-Tuning of Transformers

2021-11-16ACL ARR November 2021Unverified0· sign in to hype

Anonymous

Unverified — Be the first to reproduce this paper.

Abstract

Fine-tuning masked language models is widely adopted for transfer learning to downstream tasks and can be achieved by (1) freezing gradients of the pretrained network or only updating gradients of a newly added classification layer or (2) performing gradient updates on all parameters. Gradual unfreezing trades off between the two by gradually unfreezing gradients of whole layers during training. We propose to extend this to stochastic gradient masking to regularize pretrained language models for improved fine-tuning performance. We introduce GradDrop and variants thereof, a class of gradient sparsification methods that mask gradients prior to gradient descent. Unlike gradual unfreezing which is non-sparse and deterministic, GradDrop is sparse and stochastic. Experiments on the multilingual XGLUE benchmark with XLM-R_Large show that GradDrop outperforms standard fine-tuning and gradual unfreezing, while being competitive against methods that use additional translated data and intermediate pretraining. Lastly, we identify cases where largest zero-shot performance gains are on less resourced languages.

Tasks

Transfer Learning XLM-R

Gradient Sparsification For Masked Fine-Tuning of Transformers

Abstract

Tasks

Reproductions