Evaluating the Impact of Sub-word Information and Cross-lingual Word Embeddings on Mi'kmaq Language Modelling

2020-05-01LREC 2020Unverified0· sign in to hype

Jeremie Boudreau, Akankshya Patra, Ashima Suvarna, Paul Cook

Unverified — Be the first to reproduce this paper.

Abstract

Mi'kmaq is an Indigenous language spoken primarily in Eastern Canada. It is polysynthetic and low-resource. In this paper we consider a range of n-gram and RNN language models for Mi'kmaq. We find that an RNN language model, initialized with pre-trained fastText embeddings, performs best, highlighting the importance of sub-word information for Mi'kmaq language modelling. We further consider approaches to language modelling that incorporate cross-lingual word embeddings, but do not see improvements with these models. Finally we consider language models that operate over segmentations produced by SentencePiece --- which include sub-word units as tokens --- as opposed to word-level models. We see improvements for this approach over word-level language models, again indicating that sub-word modelling is important for Mi'kmaq language modelling.

Tasks

Cross-Lingual Word Embeddings Language Modeling Language Modelling Word Embeddings

Evaluating the Impact of Sub-word Information and Cross-lingual Word Embeddings on Mi'kmaq Language Modelling

Abstract

Tasks

Reproductions