Low-Resource Machine Translation Using Cross-Lingual Language Model Pretraining

2021-06-01NAACL (AmericasNLP) 2021Unverified0· sign in to hype

Francis Zheng, Machel Reid, Edison Marrese-Taylor, Yutaka Matsuo

Unverified — Be the first to reproduce this paper.

Abstract

This paper describes UTokyo’s submission to the AmericasNLP 2021 Shared Task on machine translation systems for indigenous languages of the Americas. We present a low-resource machine translation system that improves translation accuracy using cross-lingual language model pretraining. Our system uses an mBART implementation of fairseq to pretrain on a large set of monolingual data from a diverse set of high-resource languages before finetuning on 10 low-resource indigenous American languages: Aymara, Bribri, Asháninka, Guaraní, Wixarika, Náhuatl, Hñähñu, Quechua, Shipibo-Konibo, and Rarámuri. On average, our system achieved BLEU scores that were 1.64 higher and chrF scores that were 0.0749 higher than the baseline.

Tasks

Language Modeling Language Modelling Machine Translation Translation

Low-Resource Machine Translation Using Cross-Lingual Language Model Pretraining

Abstract

Tasks

Reproductions