SOTAVerified

Sequence-to-Sequence Multilingual Pre-Trained Models: A Hope for Low-Resource Language Translation?

2021-11-16ACL ARR November 2021Unverified0· sign in to hype

Anonymous

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

We investigate the capability of mBART, a sequence-to-sequence multilingual pre-trained model in translating low-resource languages under five factors: the amount of data used in pre-training the original model, the amount of data used in fine-tuning, the noisiness of the data used for fine-tuning, the domain-relatedness between the pre-training, fine-tuning, and testing datasets, and the language relatedness. When limited parallel corpora are available, fine-tuning mBART can measurably improve translation performance over training Transformers from scratch. mBART effectively uses even domain-mismatched text, suggesting that mBART can learn meaningful representations when data is scarce. Still, it founders when too-small data in unseen languages is provided.

Tasks

Reproductions