Language Identification

Language identification is the task of determining the language of a text.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–50 of 794 papers

Title	Date	Tasks	Status	Hype
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training	May 23, 2025	Automatic Speech RecognitionEmotion Recognition	CodeCode Available	11
TweetNLP: Cutting-Edge Natural Language Processing for Social Media	Jun 29, 2022	Language IdentificationNamed Entity Recognition	CodeCode Available	2
MathPile: A Billion-Token-Scale Pretraining Corpus for Math	Dec 28, 2023	Language IdentificationMath	CodeCode Available	2
An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios	Jun 13, 2024	Language IdentificationSelf-Supervised Learning	CodeCode Available	2
Word-level Language Identification Using Subword Embeddings for Code-mixed Bangla-English Social Media Data	Jun 1, 2022	Language IdentificationPOS	CodeCode Available	1
Language-Informed Beam Search Decoding for Multilingual Machine Translation	Aug 11, 2024	Language IdentificationMachine Translation	CodeCode Available	1
Language and Speech Technology for Central Kurdish Varieties	Mar 4, 2024	Automatic Speech RecognitionDiversity	CodeCode Available	1
Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus	Oct 27, 2020	Language Identification	CodeCode Available	1
PALI: A Language Identification Benchmark for Perso-Arabic Scripts	Apr 3, 2023	Language Identification	CodeCode Available	1
Triplet Entropy Loss: Improving The Generalisation of Short Speech Language Identification Systems	Dec 3, 2020	Language IdentificationSpeech Language Identification	CodeCode Available	1
An Open Dataset and Model for Language Identification	May 23, 2023	Language Identificationmodel	CodeCode Available	1
Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from Whisper	Sep 15, 2023	Language Identificationspeech-recognition	CodeCode Available	1
SpeechBrain: A General-Purpose Speech Toolkit	Jun 8, 2021	Language IdentificationSpoken Language Understanding	CodeCode Available	1
MaskLID: Code-Switching Language Identification through Iterative Masking	Jun 10, 2024	Language IdentificationSentence	CodeCode Available	1
GlotLID: Language Identification for Low-Resource Languages	Oct 24, 2023	Dialect IdentificationLanguage Identification	CodeCode Available	1
KInIT at SemEval-2024 Task 8: Fine-tuned LLMs for Multilingual Machine-Generated Text Detection	Feb 21, 2024	Language Identificationparameter-efficient fine-tuning	CodeCode Available	1
DravidianCodeMix: Sentiment Analysis and Offensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text	Jun 17, 2021	Language IdentificationSentiment Analysis	CodeCode Available	1
IndicSUPERB: A Speech Processing Universal Performance Benchmark for Indian languages	Aug 24, 2022	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	CodeCode Available	1
L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and BERT Language Models	Apr 18, 2022	Language IdentificationLanguage Modelling	CodeCode Available	1
NLPDove at SemEval-2020 Task 12: Improving Offensive Language Detection with Cross-lingual Transfer	Aug 4, 2020	Cross-Lingual TransferData Augmentation	CodeCode Available	1
SOLD: Sinhala Offensive Language Dataset	Dec 1, 2022	Language IdentificationSentence	CodeCode Available	1
The first neural machine translation system for the Erzya language	Sep 19, 2022	Language IdentificationMachine Translation	CodeCode Available	1
AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters	Jan 12, 2024	Language Identification	CodeCode Available	1
VoxLingua107: a Dataset for Spoken Language Recognition	Nov 25, 2020	Action DetectionActivity Detection	CodeCode Available	1
AfroLID: A Neural Language Identification Tool for African Languages	Oct 21, 2022	Language Identification	CodeCode Available	1
XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale	Nov 17, 2021	Language IdentificationRepresentation Learning	CodeCode Available	1
BERT-LID: Leveraging BERT to Improve Spoken Language Identification	Mar 1, 2022	Language IdentificationSpoken language identification	CodeCode Available	1
Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond	Aug 7, 2024	BenchmarkingLanguage Identification	CodeCode Available	1
Scaling Speech Technology to 1,000+ Languages	May 22, 2023	Automatic Speech RecognitionLanguage Identification	CodeCode Available	1
PHO-LID: A Unified Model Incorporating Acoustic-Phonetic and Phonotactic Information for Language Identification	Mar 23, 2022	Language Identification	CodeCode Available	1
A reproduction of Apple's bi-directional LSTM models for language identification in short strings	Feb 11, 2021	Language Identification	CodeCode Available	1
Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages	May 25, 2023	Language Identification	CodeCode Available	1
Common Voice: A Massively-Multilingual Speech Corpus	Dec 13, 2019	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	CodeCode Available	1
FastSpell: the LangId Magic Spell	Apr 12, 2024	Language Identification	CodeCode Available	1
KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media	Jul 26, 2020	Abuse DetectionLanguage Identification	CodeCode Available	1
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages	Oct 31, 2024	Language Identification	CodeCode Available	1
Hyperseed: Unsupervised Learning with Vector Symbolic Architectures	Oct 15, 2021	Few-Shot LearningLanguage Identification	CodeCode Available	1
Improving Spoken Language Identification with Map-Mix	Feb 16, 2023	Data AugmentationLanguage Identification	CodeCode Available	1
Using Radio Archives for Low-Resource Speech Recognition: Towards an Intelligent Virtual Assistant for Illiterate Users	Apr 27, 2021	Language IdentificationRepresentation Learning	CodeCode Available	1
AlexU-BackTranslation-TL at SemEval-2020 Task 12: Improving Offensive Language Detection Using Data Augmentation and Transfer Learning	Dec 1, 2020	Data AugmentationLanguage Identification	—Unverified	0
Albanian Language Identification in Text Documents	Jan 14, 2019	ArticlesGeneral Classification	—Unverified	0
A deep-learning based native-language classification by using a latent semantic analysis for the NLI Shared Task 2017	Sep 1, 2017	Automatic Speech Recognition (ASR)Dimensionality Reduction	—Unverified	0
SOLID: A Large-Scale Semi-Supervised Dataset for Offensive Language Identification	Apr 29, 2020	Language Identification	—Unverified	0
A language model based approach towards large scale and lightweight language identification systems	Oct 13, 2015	Language IdentificationLanguage Modeling	—Unverified	0
A Deep Generative Approach to Native Language Identification	Dec 1, 2020	BIG-bench Machine LearningLanguage Identification	—Unverified	0
A Code-Switching Corpus of Turkish-German Conversations	Apr 1, 2017	Automatic Speech Recognition (ASR)Language Identification	—Unverified	0
Addition of Code Mixed Features to Enhance the Sentiment Prediction of Song Lyrics	Jun 11, 2018	Language IdentificationOpinion Mining	—Unverified	0
Accurate Pinyin-English Codeswitched Language Identification	Nov 1, 2016	Language Identification	—Unverified	0
A Federated Learning Approach to Privacy Preserving Offensive Language Identification	Apr 17, 2024	Federated LearningLanguage Identification	—Unverified	0
A Dataset and Classifier for Recognizing Social Media English	Sep 1, 2017	Language IdentificationLanguage Modeling	—Unverified	0

Show:10 25 50

← PrevPage 1 of 16Next →

All datasets VOXLINGUA107 GlotLID-C Nordic Language Identification OpenSubtitles Universal Dependencies VoxForge

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	wav2vec 2.0 LV-60K	Error rate	7.2	—	Unverified
2	XLS-R	Error rate	5.7	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GlotLID	Macro F1	0.98	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	FastText	Accuracy	0.97	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Apple bi-LSTM	Accuracy	91.37	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Apple bi-LSTM	Accuracy	86.93	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	ConformerG-P	Accuracy	99.8	—	Unverified