Multi-branch Attentive Transformer

2020-06-18Code Available1· sign in to hype

Yang Fan, Shufang Xie, Yingce Xia, Lijun Wu, Tao Qin, Xiang-Yang Li, Tie-Yan Liu

Code Available — Be the first to reproduce this paper.

Code

github.com/HA-Transformer/HA-Transformer
pytorch★ 33

Abstract

While the multi-branch architecture is one of the key ingredients to the success of computer vision tasks, it has not been well investigated in natural language processing, especially sequence learning tasks. In this work, we propose a simple yet effective variant of Transformer called multi-branch attentive Transformer (briefly, MAT), where the attention layer is the average of multiple branches and each branch is an independent multi-head attention layer. We leverage two training techniques to regularize the training: drop-branch, which randomly drops individual branches during training, and proximal initialization, which uses a pre-trained Transformer model to initialize multiple branches. Experiments on machine translation, code generation and natural language understanding demonstrate that such a simple variant of Transformer brings significant improvements. Our code is available at https://github.com/HA-Transformer.

Tasks

Code Generation Machine Translation Natural Language Understanding Translation

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
IWSLT2014 German-English	MAT	BLEU score	36.22	—	Unverified
WMT2014 English-German	MAT	SacreBLEU	29.9	—	Unverified

Multi-branch Attentive Transformer

Code

Abstract

Tasks

Benchmark Results

Reproductions