Filtering of Noisy Web-Crawled Parallel Corpus: the Japanese-Bulgarian Language Pair

2022-09-01CLIB 2022Unverified0· sign in to hype

Iglika Nikolova-Stoupak, Shuichiro Shimizu, Chenhui Chu, Sadao Kurohashi

Unverified — Be the first to reproduce this paper.

Abstract

One of the main challenges within the rapidly developing field of neural machine translation is its application to low-resource languages. Recent attempts to provide large parallel corpora in rare language pairs include the generation of web-crawled corpora, which may be vast but are, unfortunately, excessively noisy. The corpus utilised to train machine translation models in the study is CCMatrix, provided by OPUS. Firstly, the corpus is cleaned based on a number of heuristic rules. Then, parts of it are selected in three discrete ways: at random, based on the “margin distance” metric that is native to the CCMatrix dataset, and based on scores derived through the application of a state-of-the-art classifier model (Acarcicek et al., 2020) utilised in a thematic WMT shared task. The performance of the issuing models is evaluated and compared. The classifier-based model does not reach high performance as compared with its margin-based counterpart, opening a discussion of ways for further improvement. Still, BLEU scores surpass those of Acarcicek et al.’s (2020) paper by over 15 points.

Tasks

Machine Translation Translation

Filtering of Noisy Web-Crawled Parallel Corpus: the Japanese-Bulgarian Language Pair

Abstract

Tasks

Reproductions