Uralic Language Identification (ULI) 2020 shared task dataset and the Wanca 2017 corpora
2020-12-01VarDial (COLING) 2020Unverified0· sign in to hype
Tommi Jauhiainen, Heidi Jauhiainen, Niko Partanen, Krister Lindén
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
This article introduces the Wanca 2017 web corpora from which the sentences written in minor Uralic languages were collected for the test set of the Uralic Language Identification (ULI) 2020 shared task. We describe the ULI shared task and how the test set was constructed using the Wanca 2017 corpora and texts in different languages from the Leipzig corpora collection. We also provide the results of a baseline language identification experiment conducted using the ULI 2020 dataset.