SOTAVerified

Samsung R&D Institute Poland submission to WAT 2021 Indic Language Multilingual Task

2021-08-01ACL (WAT) 2021Unverified0· sign in to hype

Adam Dobrowolski, Marcin Szymański, Marcin Chochowski, Paweł Przybysz

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

This paper describes the submission to the WAT 2021 Indic Language Multilingual Task by Samsung R&D Institute Poland. The task covered translation between 10 Indic Languages (Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil and Telugu) and English. We combined a variety of techniques: transliteration, filtering, backtranslation, domain adaptation, knowledge-distillation and finally ensembling of NMT models. We applied an effective approach to low-resource training that consist of pretraining on backtranslations and tuning on parallel corpora. We experimented with two different domain-adaptation techniques which significantly improved translation quality when applied to monolingual corpora. We researched and applied a novel approach for finding the best hyperparameters for ensembling a number of translation models. All techniques combined gave significant improvement - up to +8 BLEU over baseline results. The quality of the models has been confirmed by the human evaluation where SRPOL models scored best for all 5 manually evaluated languages.

Tasks

Reproductions