Language Identification

Language identification is the task of determining the language of a text.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 26–50 of 794 papers

Title	Date	Tasks	Status	Hype	Score
VoxLingua107: a Dataset for Spoken Language Recognition	Nov 25, 2020	Action DetectionActivity Detection	CodeCode Available	1	5
DravidianCodeMix: Sentiment Analysis and Offensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text	Jun 17, 2021	Language IdentificationSentiment Analysis	CodeCode Available	1	5
The first neural machine translation system for the Erzya language	Sep 19, 2022	Language IdentificationMachine Translation	CodeCode Available	1	5
Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus	Oct 27, 2020	Language Identification	CodeCode Available	1	5
Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from Whisper	Sep 15, 2023	Language Identificationspeech-recognition	CodeCode Available	1	5
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages	Oct 31, 2024	Language Identification	CodeCode Available	1	5
An Open Dataset and Model for Language Identification	May 23, 2023	Language Identificationmodel	CodeCode Available	1	5
A reproduction of Apple's bi-directional LSTM models for language identification in short strings	Feb 11, 2021	Language Identification	CodeCode Available	1	5
Hyperseed: Unsupervised Learning with Vector Symbolic Architectures	Oct 15, 2021	Few-Shot LearningLanguage Identification	CodeCode Available	1	5
BERT-LID: Leveraging BERT to Improve Spoken Language Identification	Mar 1, 2022	Language IdentificationSpoken language identification	CodeCode Available	1	5
Common Voice: A Massively-Multilingual Speech Corpus	Dec 13, 2019	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	CodeCode Available	1	5
Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages	May 25, 2023	Language Identification	CodeCode Available	1	5
KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media	Jul 26, 2020	Abuse DetectionLanguage Identification	CodeCode Available	1	5
Language-Informed Beam Search Decoding for Multilingual Machine Translation	Aug 11, 2024	Language IdentificationMachine Translation	CodeCode Available	1	5
GeezSwitch: Language Identification in Typologically Related Low-resourced East African Languages	Jun 1, 2022	Language IdentificationMachine Translation	CodeCode Available	0	5
Geographic Adaptation of Pretrained Language Models	Mar 16, 2022	Language IdentificationLanguage Modeling	CodeCode Available	0	5
From English to Code-Switching: Transfer Learning with Strong Morphological Clues	Sep 11, 2019	Language IdentificationNamed Entity Recognition (NER)	CodeCode Available	0	5
Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding	Jan 10, 2025	Automatic Speech RecognitionClassification	CodeCode Available	0	5
From N-grams to Pre-trained Multilingual Models For Language Identification	Oct 11, 2024	Language IdentificationXLM-R	CodeCode Available	0	5
Geographically-Informed Language Identification	Mar 14, 2024	Language Identification	CodeCode Available	0	5
Aggressive Language Identification Using Word Embeddings and Sentiment Features	Aug 1, 2018	Aggression IdentificationBIG-bench Machine Learning	CodeCode Available	0	5
Finding Structure in Text, Genome and Other Symbolic Sequences	Jul 8, 2012	Information RetrievalLanguage Identification	CodeCode Available	0	5
An Investigation into the Contribution of Locally Aggregated Descriptors to Figurative Language Identification	Nov 1, 2021	Language IdentificationNatural Language Understanding	CodeCode Available	0	5
AfriHuBERT: A self-supervised speech representation model for African languages	Sep 30, 2024	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	CodeCode Available	0	5
FBK-DH at SemEval-2020 Task 12: Using Multi-channel BERT for Multilingual Offensive Language Detection	Dec 1, 2020	Language IdentificationMachine Translation	CodeCode Available	0	5

Show:10 25 50

← PrevPage 2 of 32Next →

All datasets VOXLINGUA107 GlotLID-C Nordic Language Identification OpenSubtitles Universal Dependencies VoxForge

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	wav2vec 2.0 LV-60K	Error rate	7.2	—	Unverified
2	XLS-R	Error rate	5.7	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GlotLID	Macro F1	0.98	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	FastText	Accuracy	0.97	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Apple bi-LSTM	Accuracy	91.37	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Apple bi-LSTM	Accuracy	86.93	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	ConformerG-P	Accuracy	99.8	—	Unverified