Language Identification

Language identification is the task of determining the language of a text.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–25 of 794 papers

Title	Date	Tasks	Status	Hype
Neighbors and relatives: How do speech embeddings reflect linguistic connections across the world?	Jun 10, 2025	Language Identification	—Unverified	0
mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks	Jun 10, 2025	Language IdentificationQuestion Answering	—Unverified	0
Recursive Semantic Anchoring in ISO 639:2023: A Structural Extension to ISO/TC 37 Frameworks	Jun 7, 2025	Language Identification	—Unverified	0
TalTech Systems for the Interspeech 2025 ML-SUPERB 2.0 Challenge	Jun 2, 2025	Language Identificationspeech-recognition	—Unverified	0
Improving Multilingual Speech Models on ML-SUPERB 2.0: Fine-tuning with Data Augmentation and LID-Aware CTC	May 30, 2025	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified	0
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training	May 23, 2025	Automatic Speech RecognitionEmotion Recognition	CodeCode Available	11
Token Masking Improves Transformer-Based Text Classification	May 16, 2025	AttributeClassification	—Unverified	0
Advancing Uto-Aztecan Language Technologies: A Case Study on the Endangered Comanche Language	May 10, 2025	Language IdentificationSynthetic Data Generation	CodeCode Available	0
Improving Informally Romanized Language Identification	Apr 30, 2025	Language Identification	—Unverified	0
(Im)possibility of Automated Hallucination Detection in Large Language Models	Apr 23, 2025	HallucinationLanguage Identification	—Unverified	0
COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing	Mar 27, 2025	Language Identificationnamed-entity-recognition	—Unverified	0
KréyoLID From Language Identification Towards Language Mining	Mar 9, 2025	Language IdentificationMulti-class Classification	CodeCode Available	0
NusaAksara: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts	Feb 25, 2025	Image SegmentationLanguage Identification	—Unverified	0
English Please: Evaluating Machine Translation with Large Language Models for Multilingual Bug Reports	Feb 20, 2025	Domain AdaptationLanguage Identification	CodeCode Available	0
Multi-label Scandinavian Language Identification (SLIDE)	Feb 10, 2025	Language IdentificationSentence	CodeCode Available	0
On the use of Performer and Agent Attention for Spoken Language Identification	Feb 9, 2025	Language IdentificationSelf-Supervised Learning	—Unverified	0
Evaluating Standard and Dialectal Frisian ASR: Multilingual Fine-tuning and Language Identification for Improved Low-resource Performance	Feb 7, 2025	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified	0
Is It Navajo? Accurate Language Detection in Endangered Athabaskan Languages	Jan 27, 2025	DiversityLanguage Identification	CodeCode Available	0
Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding	Jan 10, 2025	Automatic Speech RecognitionClassification	CodeCode Available	0
Indonesian-English Code-Switching Speech Synthesizer Utilizing Multilingual STEN-TTS and Bert LID	Dec 26, 2024	Language Identificationtext-to-speech	—Unverified	0
Enhancing Code-Switching ASR Leveraging Non-Peaky CTC Loss and Deep Language Posterior Injection	Nov 26, 2024	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified	0
Exploring Facets of Language Generation in the Limit	Nov 22, 2024	Language IdentificationText Generation	—Unverified	0
Can adversarial attacks by large language models be attributed?	Nov 12, 2024	AttributeLanguage Identification	—Unverified	0
Prompt Engineering Using GPT for Word-Level Code-Mixed Language Identification in Low-Resource Dravidian Languages	Nov 6, 2024	Information RetrievalLanguage Identification	—Unverified	0
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages	Oct 31, 2024	Language Identification	CodeCode Available	1

Show:10 25 50

← PrevPage 1 of 32Next →

All datasets VOXLINGUA107 GlotLID-C Nordic Language Identification OpenSubtitles Universal Dependencies VoxForge

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	wav2vec 2.0 LV-60K	Error rate	7.2	—	Unverified
2	XLS-R	Error rate	5.7	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GlotLID	Macro F1	0.98	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	FastText	Accuracy	0.97	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Apple bi-LSTM	Accuracy	91.37	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Apple bi-LSTM	Accuracy	86.93	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	ConformerG-P	Accuracy	99.8	—	Unverified