SOTAVerified

Subword-based Cross-lingual Transfer of Embeddings from Hindi to Marathi

2021-10-16ACL ARR October 2021Unverified0· sign in to hype

Anonymous

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

Word embeddings are growing to be a crucial resource in the field of NLP for any language. This work focuses on static subword embeddings transfer for Indian languages from a relatively higher resource language to a genealogically related low resource language. We work with Hindi-Marathi as our language pair, simulating a low-resource scenario for Marathi. We demonstrate the consistent benefits of unsupervised morphemic segmentation on both source and target sides over the treatment performed by FastText. We show that a trivial "copy-and-paste'' embeddings transfer based on even perfect bilingual lexicons is inadequate in capturing language-specific relationships. Our best-performing approach uses an EM-style approach to learning bilingual subword embeddings; the resulting embeddings are evaluated using the publicly available Marathi Word Similarity task as well as WordNet-Based Synonymy Tests. We find that our approach significantly outperforms the FastText baseline on both tasks; on the former task, its performance is close to that of pretrained FastText Marathi embeddings that use two orders of magnitude more Marathi data.

Tasks

Reproductions