Aligning Word Vectors on Low-Resource Languages with Wiktionary
2022-10-01loresmt (COLING) 2022Code Available0· sign in to hype
Mike Izbicki
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/mikeizbicki/wiktionary_bliOfficialIn papernone★ 4
Abstract
Aligned word embeddings have become a popular technique for low-resource natural language processing. Most existing evaluation datasets are generated automatically from machine translations systems, so they have many errors and exist only for high-resource languages. We introduce the Wiktionary bilingual lexicon collection, which provides high-quality human annotated translations for words in 298 languages to English. We use these lexicons to train and evaluate the largest published collection of aligned word embeddings on 157 different languages. All of our code and data is publicly available at https://github.com/mikeizbicki/wiktionary_bli.