Hyena Hierarchy: Towards Larger Convolutional Language Models

2023-02-21Code Available2· sign in to hype

Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, Christopher Ré

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/hazyresearch/safari
OfficialIn paperpytorch★ 911
github.com/togethercomputer/stripedhyena
pytorch★ 417
github.com/lindermanlab/S5
jax★ 317
github.com/i404788/s5-pytorch
jax★ 82
github.com/Suro-One/Hyena-Hierarchy
pytorch★ 46
github.com/expz/annotated-hyena
none★ 34
github.com/MindSpore-scientific-2/code-4/tree/main/Hyena-A-Convolutional-Neural-Network-for-Modelling-Sentences
mindspore★ 0

Abstract

Recent advances in deep learning have relied heavily on the use of large Transformers due to their ability to learn at scale. However, the core building block of Transformers, the attention operator, exhibits quadratic cost in sequence length, limiting the amount of context accessible. Existing subquadratic methods based on low-rank and sparse approximations need to be combined with dense attention layers to match Transformers, indicating a gap in capability. In this work, we propose Hyena, a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating. In recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on state-spaces and other implicit and explicit methods, matching attention-based models. We set a new state-of-the-art for dense-attention-free architectures on language modeling in standard datasets (WikiText103 and The Pile), reaching Transformer quality with a 20% reduction in training compute required at sequence length 2K. Hyena operators are twice as fast as highly optimized attention at sequence length 8K, and 100x faster at sequence length 64K.

Tasks

2k 8k Language Modeling Language Modelling Question Answering

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
WikiText-103	Hyena-3-slim	Test perplexity	18.5	—	Unverified
WikiText-103	Hyena-3	Test perplexity	18.6	—	Unverified

Hyena Hierarchy: Towards Larger Convolutional Language Models

Code

Abstract

Tasks

Benchmark Results

Reproductions