Addressing Some Limitations of Transformers with Feedback Memory

2020-02-21Code Available1· sign in to hype

Angela Fan, Thibaut Lavril, Edouard Grave, Armand Joulin, Sainbayar Sukhbaatar

Code Available — Be the first to reproduce this paper.

Code

github.com/facebookresearch/transformer-sequential
Officialpytorch★ 143
github.com/lucidrains/feedback-transformer-pytorch
pytorch★ 108
github.com/rajaswa/feedback-and-memory-in-transformers
pytorch★ 17

Abstract

Transformers have been successfully applied to sequential, auto-regressive tasks despite being feedforward networks. Unlike recurrent neural networks, Transformers use attention to capture temporal relations while processing input tokens in parallel. While this parallelization makes them computationally efficient, it restricts the model from fully exploiting the sequential nature of the input. The representation at a given layer can only access representations from lower layers, rather than the higher level representations already available. In this work, we propose the Feedback Transformer architecture that exposes all previous representations to all future representations, meaning the lowest representation of the current timestep is formed from the highest-level abstract representation of the past. We demonstrate on a variety of benchmarks in language modeling, machine translation, and reinforcement learning that the increased representation capacity can create small, shallow models with much stronger performance than comparable Transformers.

Tasks

Language Modeling Language Modelling Machine Translation Reinforcement Learning Translation

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
enwik8	Feedback Transformer	Bit per Character (BPC)	0.96	—	Unverified
Penn Treebank (Character Level)	Feedback Transformer	Bit per Character (BPC)	1.16	—	Unverified
WikiText-103	Feedback Transformer (8 layers)	Test perplexity	18.2	—	Unverified
WikiText-103	Feedback Transformer (4 layers)	Test perplexity	22.4	—	Unverified

Addressing Some Limitations of Transformers with Feedback Memory

Code

Abstract

Tasks

Benchmark Results

Reproductions