H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences

2021-07-25ACL 2021Code Available1· sign in to hype

Zhenhai Zhu, Radu Soricut

Code Available — Be the first to reproduce this paper.

Code

github.com/lucidrains/h-transformer-1d
pytorch★ 167
github.com/jinmang2/hierarchical-transformer-1d
pytorch★ 9

Abstract

We describe an efficient hierarchical method to compute attention in the Transformer architecture. The proposed attention mechanism exploits a matrix structure similar to the Hierarchical Matrix (H-Matrix) developed by the numerical analysis community, and has linear run time and memory complexity. We perform extensive experiments to show that the inductive bias embodied by our hierarchical attention is effective in capturing the hierarchical structure in the sequences typical for natural language and vision tasks. Our method is superior to alternative sub-quadratic proposals by over +6 points on average on the Long Range Arena benchmark. It also sets a new SOTA test perplexity on One-Billion Word dataset with 5x fewer model parameters than that of the previous-best Transformer-based models.

Tasks

Inductive Bias Language Modelling

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
One Billion Word	H-Transformer-1D Nr=16 (Base)	Validation perplexity	23.95	—	Unverified
One Billion Word	H-Transformer-1D Nr=16 (Large)	Validation perplexity	20.25	—	Unverified

H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences

Code

Abstract

Tasks

Benchmark Results

Reproductions