Language Identification

Language identification is the task of determining the language of a text.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 26–50 of 794 papers

Title	Date	Tasks	Status	Hype
PHO-LID: A Unified Model Incorporating Acoustic-Phonetic and Phonotactic Information for Language Identification	Mar 23, 2022	Language Identification	CodeCode Available	1
BERT-LID: Leveraging BERT to Improve Spoken Language Identification	Mar 1, 2022	Language IdentificationSpoken language identification	CodeCode Available	1
XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale	Nov 17, 2021	Language IdentificationRepresentation Learning	CodeCode Available	1
Hyperseed: Unsupervised Learning with Vector Symbolic Architectures	Oct 15, 2021	Few-Shot LearningLanguage Identification	CodeCode Available	1
DravidianCodeMix: Sentiment Analysis and Offensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text	Jun 17, 2021	Language IdentificationSentiment Analysis	CodeCode Available	1
SpeechBrain: A General-Purpose Speech Toolkit	Jun 8, 2021	Language IdentificationSpoken Language Understanding	CodeCode Available	1
Using Radio Archives for Low-Resource Speech Recognition: Towards an Intelligent Virtual Assistant for Illiterate Users	Apr 27, 2021	Language IdentificationRepresentation Learning	CodeCode Available	1
A reproduction of Apple's bi-directional LSTM models for language identification in short strings	Feb 11, 2021	Language Identification	CodeCode Available	1
Triplet Entropy Loss: Improving The Generalisation of Short Speech Language Identification Systems	Dec 3, 2020	Language IdentificationSpeech Language Identification	CodeCode Available	1
VoxLingua107: a Dataset for Spoken Language Recognition	Nov 25, 2020	Action DetectionActivity Detection	CodeCode Available	1
Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus	Oct 27, 2020	Language Identification	CodeCode Available	1
NLPDove at SemEval-2020 Task 12: Improving Offensive Language Detection with Cross-lingual Transfer	Aug 4, 2020	Cross-Lingual TransferData Augmentation	CodeCode Available	1
KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media	Jul 26, 2020	Abuse DetectionLanguage Identification	CodeCode Available	1
Common Voice: A Massively-Multilingual Speech Corpus	Dec 13, 2019	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	CodeCode Available	1
mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks	Jun 10, 2025	Language IdentificationQuestion Answering	—Unverified	0
Neighbors and relatives: How do speech embeddings reflect linguistic connections across the world?	Jun 10, 2025	Language Identification	—Unverified	0
Recursive Semantic Anchoring in ISO 639:2023: A Structural Extension to ISO/TC 37 Frameworks	Jun 7, 2025	Language Identification	—Unverified	0
TalTech Systems for the Interspeech 2025 ML-SUPERB 2.0 Challenge	Jun 2, 2025	Language Identificationspeech-recognition	—Unverified	0
Improving Multilingual Speech Models on ML-SUPERB 2.0: Fine-tuning with Data Augmentation and LID-Aware CTC	May 30, 2025	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified	0
Token Masking Improves Transformer-Based Text Classification	May 16, 2025	AttributeClassification	—Unverified	0
Advancing Uto-Aztecan Language Technologies: A Case Study on the Endangered Comanche Language	May 10, 2025	Language IdentificationSynthetic Data Generation	CodeCode Available	0
Improving Informally Romanized Language Identification	Apr 30, 2025	Language Identification	—Unverified	0
(Im)possibility of Automated Hallucination Detection in Large Language Models	Apr 23, 2025	HallucinationLanguage Identification	—Unverified	0
COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing	Mar 27, 2025	Language Identificationnamed-entity-recognition	—Unverified	0
KréyoLID From Language Identification Towards Language Mining	Mar 9, 2025	Language IdentificationMulti-class Classification	CodeCode Available	0

Show:10 25 50

← PrevPage 2 of 32Next →

All datasets VOXLINGUA107 GlotLID-C Nordic Language Identification OpenSubtitles Universal Dependencies VoxForge

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	wav2vec 2.0 LV-60K	Error rate	7.2	—	Unverified
2	XLS-R	Error rate	5.7	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GlotLID	Macro F1	0.98	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	FastText	Accuracy	0.97	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Apple bi-LSTM	Accuracy	91.37	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Apple bi-LSTM	Accuracy	86.93	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	ConformerG-P	Accuracy	99.8	—	Unverified