Training Data Augmentation for Code-Mixed Translation

2021-06-01NAACL 2021Code Available0· sign in to hype

Abhirut Gupta, Aditya Vavre, Sunita Sarawagi

Code Available — Be the first to reproduce this paper.

Code

github.com/shruikan20/spoken-tutorial-dataset
OfficialIn papernone★ 1

Abstract

Machine translation of user-generated code-mixed inputs to English is of crucial importance in applications like web search and targeted advertising. We address the scarcity of parallel training data for training such models by designing a strategy of converting existing non-code-mixed parallel data sources to code-mixed parallel data. We present an m-BERT based procedure whose core learnable component is a ternary sequence labeling model, that can be trained with a limited code-mixed corpus alone. We show a 5.8 point increase in BLEU on heavily code-mixed sentences by training a translation model using our data augmentation strategy on an Hindi-English code-mixed translation task.

Tasks

Data Augmentation Machine Translation Translation

Training Data Augmentation for Code-Mixed Translation

Code

Abstract

Tasks

Reproductions