SOTAVerified

High Frequent In-domain Words Segmentation and Forward Translation for the WMT21 Biomedical Task

2021-11-01WMT (EMNLP) 2021Unverified0· sign in to hype

Bardia Rafieian, Marta Ruiz Costa Jussa

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

This paper reports the optimization of using the out-of-domain data in the Biomedical translation task. We firstly optimized our parallel training dataset using the BabelNet in-domain terminology words. Afterward, to increase the training set, we studied the effects of the out-of-domain data on biomedical translation tasks, and we created a mixture of in-domain and out-of-domain training sets and added more in-domain data using forward translation in the English-Spanish task. Finally, with a simple bpe optimization method, we increased the number of in-domain sub-words in our mixed training set and trained the Transformer model on the generated data. Results show improvements using our proposed method.

Tasks

Reproductions