Attention Is All You Need

2017-06-12NeurIPS 2017Code Available3· sign in to hype

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/jadore801120/attention-is-all-you-need-pytorch
pytorch★ 9,663
github.com/going-doer/paper2code
none★ 4,271
github.com/shreyashankar/gpt3-sandbox
none★ 2,880
github.com/davisking/dlib-models
none★ 1,605
github.com/IBM/pytorch-seq2seq
pytorch★ 1,518
github.com/veekaybee/what_are_embeddings
tf★ 1,060
github.com/maxjcohen/transformer
pytorch★ 904
github.com/lukemelas/PyTorch-Pretrained-ViT
pytorch★ 853
github.com/vinairesearch/phogpt
pytorch★ 798
github.com/studio-ousia/luke
pytorch★ 727

Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Tasks

Abstractive Text Summarization All Coreference Resolution Decoder Few-Shot 3D Point Cloud Classification Image-guided Story Ending Generation LIDAR Semantic Segmentation Link Prediction Machine Translation Multimodal Machine Translation Natural Language Understanding Question Answering Speech Emotion Recognition Supervised Only 3D Point Cloud Classification Text Summarization Translation

Attention Is All You Need

Code

Abstract

Tasks

Benchmark Results

Reproductions