Weighted Transformer Network for Machine Translation

2017-11-06ICLR 2018Code Available0· sign in to hype

Karim Ahmed, Nitish Shirish Keskar, Richard Socher

Code Available — Be the first to reproduce this paper.

Code

github.com/duyvuleo/Transformer-DyNet
tf★ 65
github.com/bagequan/tencent-transformer-with-disagreement
none★ 5
github.com/JayParks/transformer
pytorch★ 0
github.com/Flawless1202/Transformer
pytorch★ 0
github.com/xrick/PyTorch_Transformer
pytorch★ 0

Abstract

State-of-the-art results on neural machine translation often use attentional sequence-to-sequence models with some form of convolution or recursion. Vaswani et al. (2017) propose a new architecture that avoids recurrence and convolution completely. Instead, it uses only self-attention and feed-forward layers. While the proposed architecture achieves state-of-the-art results on several machine translation tasks, it requires a large number of parameters and training iterations to converge. We propose Weighted Transformer, a Transformer with modified attention layers, that not only outperforms the baseline network in BLEU score but also converges 15-40% faster. Specifically, we replace the multi-head attention by multiple self-attention branches that the model learns to combine during the training process. Our model improves the state-of-the-art performance by 0.5 BLEU points on the WMT 2014 English-to-German translation task and by 0.4 on the English-to-French translation task.

Tasks

Machine Translation Translation

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
WMT2014 English-French	Weighted Transformer (large)	BLEU score	41.4	—	Unverified
WMT2014 English-German	Weighted Transformer (large)	BLEU score	28.9	—	Unverified

Weighted Transformer Network for Machine Translation

Code

Abstract

Tasks

Benchmark Results

Reproductions