Lessons on Parameter Sharing across Layers in Transformers

2021-04-13Code Available1· sign in to hype

Sho Takase, Shun Kiyono

Code Available — Be the first to reproduce this paper.

Code

github.com/takase/share_layer_params
OfficialIn paperpytorch★ 28
github.com/jaketae/param-share-transformer
pytorch★ 26

Abstract

We propose a parameter sharing method for Transformers (Vaswani et al., 2017). The proposed approach relaxes a widely used technique, which shares parameters for one layer with all layers such as Universal Transformers (Dehghani et al., 2019), to increase the efficiency in the computational time. We propose three strategies: Sequence, Cycle, and Cycle (rev) to assign parameters to each layer. Experimental results show that the proposed strategies are efficient in the parameter size and computational time. Moreover, we indicate that the proposed strategies are also effective in the configuration where we use many training data such as the recent WMT competition.

Tasks

Machine Translation

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
WMT2014 English-German	Transformer Cycle (Rev)	BLEU score	35.14	—	Unverified

Lessons on Parameter Sharing across Layers in Transformers

Code

Abstract

Tasks

Benchmark Results

Reproductions