CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

2021-03-11Code Available1· sign in to hype

Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting

Code Available — Be the first to reproduce this paper.

Code

github.com/google-research/language/tree/master/language/canine
Officialtf★ 0
github.com/octanove/shiba
pytorch★ 89
github.com/pwc-1/Paper-8/tree/main/canine
mindspore★ 0
github.com/2024-MindSpore-1/Code2/tree/main/model-1/canine
mindspore★ 0
github.com/kevinng77/canine_paddle
paddle★ 0

Abstract

Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model's ability to adapt. In this paper, we present CANINE, a neural encoder that operates directly on character sequences, without explicit tokenization or vocabulary, and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, CANINE combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. CANINE outperforms a comparable mBERT model by 2.8 F1 on TyDi QA, a challenging multilingual benchmark, despite having 28% fewer model parameters.

Tasks

Inductive Bias

CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Code

Abstract

Tasks

Reproductions