A reproduction of Apple's bi-directional LSTM models for language identification in short strings

2021-02-11EACL 2021Code Available1· sign in to hype

Mads Toftrup, Søren Asger Sørensen, Manuel R. Ciosici, Ira Assent

Code Available — Be the first to reproduce this paper.

Code

github.com/AU-DIS/LSTM_langid
OfficialIn paperpytorch★ 33

Abstract

Language Identification is the task of identifying a document's language. For applications like automatic spell checker selection, language identification must use very short strings such as text message fragments. In this work, we reproduce a language identification architecture that Apple briefly sketched in a blog post. We confirm the bi-LSTM model's performance and find that it outperforms current open-source language identifiers. We further find that its language identification mistakes are due to confusion between related languages.

Tasks

Language Identification

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
OpenSubtitles	Apple bi-LSTM	Accuracy	91.37	—	Unverified
Universal Dependencies	Apple bi-LSTM	Accuracy	86.93	—	Unverified

A reproduction of Apple's bi-directional LSTM models for language identification in short strings

Code

Abstract

Tasks

Benchmark Results

Reproductions