TACO: Pre-training of Deep Transformers with Attention Convolution using Disentangled Positional Representation

2021-11-16ACL ARR November 2021Unverified0· sign in to hype

Anonymous

Unverified — Be the first to reproduce this paper.

Abstract

Word order, as a crucial part to understand natural language, has been carefully considered in pre-trained models by incorporating different kinds of positional encodings. However, existing pre-trained models mostly lack the ability to maintain robustness against minor permutation of words in learned representations. We therefore propose a novel architecture named Transformer with Attention COnvolution (TACO), to explicitly disentangle positional representations and incorporate convolution over multi-source attention maps before softmax in self-attention. Additionally, we design a novel self-supervised task, masked position modeling (MPM), to assist our TACO model in capturing complex patterns with regard to word order. Combining MLM (masked language modeling) and MPM objectives, the proposed TACO model can efficiently learn two disentangled vectors for each token, representing its content and position respectively. Experimental results show that TACO significantly outperforms BERT in various downstream tasks with fewer model parameters. Remarkably, TACO achieves +2.6% improvement over BERT on SQuAD 1.1 task, +5.4% on SQuAD 2.0 and +3.4% on RACE, with only 46K pre-training steps.

Tasks

Language Modeling Language Modelling Masked Language Modeling Position

TACO: Pre-training of Deep Transformers with Attention Convolution using Disentangled Positional Representation

Abstract

Tasks

Reproductions