SOTAVerified

An Evaluation of Subword Segmentation Strategies for Neural Machine Translation of Morphologically Rich Languages

2020-07-01WS 2020Unverified0· sign in to hype

Aquia Richburg, Esk, Ramy er, Smar Muresan, a, Marine Carpuat

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

Byte-Pair Encoding (BPE) (Sennrich et al., 2016) has become a standard pre-processing step when building neural machine translation systems. However, it is not clear whether this is an optimal strategy in all settings. We conduct a controlled comparison of subword segmentation strategies for translating two low-resource morphologically rich languages (Swahili and Turkish) into English. We show that segmentations based on a unigram language model (Kudo, 2018) yield comparable BLEU and better recall for translating rare source words than BPE.

Tasks

Reproductions