Uralic Language Identification (ULI) 2020 shared task dataset and the Wanca 2017 corpora

2020-12-01VarDial (COLING) 2020Unverified0· sign in to hype

Tommi Jauhiainen, Heidi Jauhiainen, Niko Partanen, Krister Lindén

Unverified — Be the first to reproduce this paper.

Abstract

This article introduces the Wanca 2017 web corpora from which the sentences written in minor Uralic languages were collected for the test set of the Uralic Language Identification (ULI) 2020 shared task. We describe the ULI shared task and how the test set was constructed using the Wanca 2017 corpora and texts in different languages from the Leipzig corpora collection. We also provide the results of a baseline language identification experiment conducted using the ULI 2020 dataset.

Tasks

Language Identification

Uralic Language Identification (ULI) 2020 shared task dataset and the Wanca 2017 corpora

Abstract

Tasks

Reproductions