Multilingual Multi-Domain NMT for Indian Languages
Sourav Kumar, Salil Aggarwal, Dipti Sharma
Unverified โ Be the first to reproduce this paper.
ReproduceAbstract
India is known as the land of many tongues and dialects. Neural machine translation (NMT) is the current state-of-the-art approach for machine translation (MT) but performs better only with large datasets which Indian languages usually lack, making this approach infeasible. So, in this paper, we address the problem of data scarcity by efficiently training multilingual and multilingual multi domain NMT systems involving languages of the ๐๐ง๐๐ข๐๐ง ๐ฌ๐ฎ๐๐๐จ๐ง๐ญ๐ข๐ง๐๐ง๐ญ. We are proposing the technique for using the joint domain and language tags in a multilingual setup. We draw three major conclusions from our experiments: (i) Training a multilingual system via exploiting lexical similarity based on language family helps in achieving an overall average improvement of ๐.๐๐ ๐๐๐๐ ๐ฉ๐จ๐ข๐ง๐ญ๐ฌ over bilingual baselines, (ii) Technique of incorporating domain information into the language tokens helps multilingual multi-domain system in getting a significant average improvement of ๐ ๐๐๐๐ ๐ฉ๐จ๐ข๐ง๐ญ๐ฌ over the baselines, (iii) Multistage fine-tuning further helps in getting an improvement of ๐-๐.๐ ๐๐๐๐ ๐ฉ๐จ๐ข๐ง๐ญ๐ฌ for the language pair of interest.