SOTAVerified

Morphologically-Guided Segmentation For Translation of Agglutinative Low-Resource Languages

2021-08-01MTSummit 2021Code Available0· sign in to hype

William Chen, Brett Fazio

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Neural Machine Translation (NMT) for Low Resource Languages (LRL) is often limited by the lack of available training data, making it necessary to explore additional techniques to improve translation quality. We propose the use of the Prefix-Root-Postfix-Encoding (PRPE) subword segmentation algorithm to improve translation quality for LRLs, using two agglutinative languages as case studies: Quechua and Indonesian. During the course of our experiments, we reintroduce a parallel corpus for Quechua-Spanish translation that was previously unavailable for NMT. Our experiments show the importance of appropriate subword segmentation, which can go as far as improving translation quality over systems trained on much larger quantities of data. We show this by achieving state-of-the-art results for both languages, obtaining higher BLEU scores than large pre-trained models with much smaller amounts of data.

Tasks

Reproductions