SOTAVerified

HeLI-OTS, Off-the-shelf Language Identifier for Text

2022-06-01LREC 2022Unverified0· sign in to hype

Tommi Jauhiainen, Heidi Jauhiainen, Krister Lindén

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

This paper introduces HeLI-OTS, an off-the-shelf text language identification tool using the HeLI language identification method. The HeLI-OTS language identifier is equipped with language models for 200 languages and licensed for academic as well as commercial use. We present the HeLI method and its use in our previous research. Then we compare the performance of the HeLI-OTS language identifier with that of fastText on two different data sets, showing that fastText favors the recall of common languages, whereas HeLI-OTS reaches both high recall and high precision for all languages. While introducing existing off-the-shelf language identification tools, we also give a picture of digital humanities-related research that uses such tools. The validity of the results of such research depends on the results given by the language identifier used, and especially for research focusing on the less common languages, the tendency to favor widely used languages might be very detrimental, which Heli-OTS is now able to remedy.

Tasks

Reproductions