RealFormer: Transformer Likes Residual Attention

2020-12-21Findings (ACL) 2021Code Available1· sign in to hype

Ruining He, Anirudh Ravula, Bhargav Kanagal, Joshua Ainslie

Code Available — Be the first to reproduce this paper.

Code

github.com/google-research/google-research
OfficialIn papertf★ 37,519
github.com/cloneofsimo/RealFormer-pytorch
pytorch★ 101
github.com/jaketae/realformer
pytorch★ 11
github.com/JunnYu/x-transformers-paddle
jax★ 10
github.com/aivolcano/BERT_MRC_CLS
pytorch★ 5

Abstract

Transformer is the backbone of modern NLP models. In this paper, we propose RealFormer, a simple and generic technique to create Residual Attention Layer Transformer networks that significantly outperform the canonical Transformer and its variants (BERT, ETC, etc.) on a wide spectrum of tasks including Masked Language Modeling, GLUE, SQuAD, Neural Machine Translation, WikiHop, HotpotQA, Natural Questions, and OpenKP. We also observe empirically that RealFormer stabilizes training and leads to models with sparser attention. Source code and pre-trained checkpoints for RealFormer can be found at https://github.com/google-research/google-research/tree/master/realformer.

Tasks

Language Modeling Language Modelling Linguistic Acceptability Machine Translation Masked Language Modeling Natural Language Inference Natural Questions Paraphrase Identification Semantic Textual Similarity Sentiment Analysis Translation

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
CoLA	RealFormer	Accuracy	59.83	—	Unverified

RealFormer: Transformer Likes Residual Attention

Code

Abstract

Tasks

Benchmark Results

Reproductions