A reproduction of Apple's bi-directional LSTM models for language identification in short strings
2021-02-11EACL 2021Code Available1· sign in to hype
Mads Toftrup, Søren Asger Sørensen, Manuel R. Ciosici, Ira Assent
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/AU-DIS/LSTM_langidOfficialIn paperpytorch★ 33
Abstract
Language Identification is the task of identifying a document's language. For applications like automatic spell checker selection, language identification must use very short strings such as text message fragments. In this work, we reproduce a language identification architecture that Apple briefly sketched in a blog post. We confirm the bi-LSTM model's performance and find that it outperforms current open-source language identifiers. We further find that its language identification mistakes are due to confusion between related languages.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| OpenSubtitles | Apple bi-LSTM | Accuracy | 91.37 | — | Unverified |
| Universal Dependencies | Apple bi-LSTM | Accuracy | 86.93 | — | Unverified |