Semantic Tokenizer for Enhanced Natural Language Processing

2021-10-16ACL ARR October 2021Unverified0· sign in to hype

Anonymous

Unverified — Be the first to reproduce this paper.

Abstract

Traditionally, NLP performance improvement has been focused on improving models and increasing the number of parameters. Little attention has been paid to vocabulary optimization. We present a novel tokenizer that uses semantics to drive subword formation. The tokenizer includes a trainer that uses stemming to enhance subword formation. Further optimizations and adaptations are implemented to minimize the number of words that cannot be encoded. The encoder is updated to integrate with the trainer. The tokenizer is implemented as a drop-in replacement for the SentencePiece tokenizer. The new tokenizer more than doubles the number of wordforms represented in the vocabulary. The enhanced vocabulary significantly improves model convergence, quality of word and sentence embeddings. Our experimental results show top performance on two Glue tasks using BERT-base, improving on models more than 20 in size.

Tasks

Sentence Sentence Embeddings

Semantic Tokenizer for Enhanced Natural Language Processing

Abstract

Tasks

Reproductions