SOTAVerified

Unified NMT models for the Indian subcontinent transcending script-barriers

2021-11-16ACL ARR November 2021Unverified0· sign in to hype

Anonymous

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

Highly accurate machine translation systems are very important in societies and countries where multilinguality is very common, and where English often does not suffice. The Indian subcontinent is such a region, with all the Indic languages currently being under-represented in the NLP ecosystem. It is essential to advance the state-of-the-art of such low-resource languages atleast by using whatever data is available in open-source, which itself is something not very explored in the Indic ecosystem. In our work, we focus on improving the performance of very-low-resource Indic languages, especially of countries in addition to India. Specifically, we propose how unified models can be built that can exploit the data from comparatively resource-rich languages of the same region. We propose strategies to unify different types of unexplored scripts, especially Perso-Arabic scripts and Indic scripts to build multilingual models for all the Indic languages despite the script barrier. We also study how augmentation techniques like back-translation can be made use-of to build unified models that achieve state-of-the-art result among open source models, especially just using openly available raw data.

Tasks

Reproductions