Multi-label Scandinavian Language Identification (SLIDE)

2025-02-10Code Available0· sign in to hype

Mariia Fedorova, Jonas Sebulon Frydenberg, Victoria Handford, Victoria Ovedie Chruickshank Langø, Solveig Helene Willoch, Marthe Løken Midtgaard, Yves Scherrer, Petter Mæhlum, David Samuel

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/ltgoslo/slide
OfficialIn paperpytorch★ 0

Abstract

Identifying closely related languages at sentence level is difficult, in particular because it is often impossible to assign a sentence to a single language. In this paper, we focus on multi-label sentence-level Scandinavian language identification (LID) for Danish, Norwegian Bokmal, Norwegian Nynorsk, and Swedish. We present the Scandinavian Language Identification and Evaluation, SLIDE, a manually curated multi-label evaluation dataset and a suite of LID models with varying speed-accuracy tradeoffs. We demonstrate that the ability to identify multiple languages simultaneously is necessary for any accurate LID method, and present a novel approach to training such multi-label LID models.

Tasks

Language Identification Sentence

Multi-label Scandinavian Language Identification (SLIDE)

Code

Abstract

Tasks

Reproductions