Adaptive Attention Span in Transformers

2019-05-19ACL 2019Code Available1· sign in to hype

Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, Armand Joulin

Code Available — Be the first to reproduce this paper.

Code

github.com/facebookresearch/adaptive-span
OfficialIn paperpytorch★ 0
github.com/jerrodparker20/adaptive-transformers-in-rl
pytorch★ 136
github.com/prajjwal1/fluence
pytorch★ 70
github.com/prajjwal1/adaptive_transformer
pytorch★ 43
github.com/JoeRoussy/adaptive-attention-in-cv
pytorch★ 35
github.com/lancopku/Explicit-Sparse-Transformer
tf★ 0
github.com/ofirpress/sandwich_transformer
pytorch★ 0
github.com/pwc-1/Paper-9/tree/main/7/Knowing-When-to-Look-Adaptive-Attention
mindspore★ 0

Abstract

We propose a novel self-attention mechanism that can learn its optimal attention span. This allows us to extend significantly the maximum context size used in Transformer, while maintaining control over their memory footprint and computational time. We show the effectiveness of our approach on the task of character level language modeling, where we achieve state-of-the-art performances on text8 and enwiki8 by using a maximum context of 8k characters.

Tasks

8k Language Modeling Language Modelling

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
enwik8	Transformer (24 layers, 8k adaptive span)	Bit per Character (BPC)	0.98	—	Unverified
enwik8	Transformer (12 layers, 8k adaptive span)	Bit per Character (BPC)	1.02	—	Unverified
Text8	24L Transformer + 8K adaptive span	Bit per Character (BPC)	1.07	—	Unverified
Text8	12L Transformer + 8K adaptive span	Bit per Character (BPC)	1.11	—	Unverified

Adaptive Attention Span in Transformers

Code

Abstract

Tasks

Benchmark Results

Reproductions