Language Identification

Language identification is the task of determining the language of a text.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–50 of 794 papers

Title	Date	Tasks	Status	Hype
mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks	Jun 10, 2025	Language IdentificationQuestion Answering	—Unverified	0
Neighbors and relatives: How do speech embeddings reflect linguistic connections across the world?	Jun 10, 2025	Language Identification	—Unverified	0
Recursive Semantic Anchoring in ISO 639:2023: A Structural Extension to ISO/TC 37 Frameworks	Jun 7, 2025	Language Identification	—Unverified	0
TalTech Systems for the Interspeech 2025 ML-SUPERB 2.0 Challenge	Jun 2, 2025	Language Identificationspeech-recognition	—Unverified	0
Improving Multilingual Speech Models on ML-SUPERB 2.0: Fine-tuning with Data Augmentation and LID-Aware CTC	May 30, 2025	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified	0
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training	May 23, 2025	Automatic Speech RecognitionEmotion Recognition	CodeCode Available	11
Token Masking Improves Transformer-Based Text Classification	May 16, 2025	AttributeClassification	—Unverified	0
Advancing Uto-Aztecan Language Technologies: A Case Study on the Endangered Comanche Language	May 10, 2025	Language IdentificationSynthetic Data Generation	CodeCode Available	0
Improving Informally Romanized Language Identification	Apr 30, 2025	Language Identification	—Unverified	0
(Im)possibility of Automated Hallucination Detection in Large Language Models	Apr 23, 2025	HallucinationLanguage Identification	—Unverified	0
COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing	Mar 27, 2025	Language Identificationnamed-entity-recognition	—Unverified	0
KréyoLID From Language Identification Towards Language Mining	Mar 9, 2025	Language IdentificationMulti-class Classification	CodeCode Available	0
NusaAksara: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts	Feb 25, 2025	Image SegmentationLanguage Identification	—Unverified	0
English Please: Evaluating Machine Translation with Large Language Models for Multilingual Bug Reports	Feb 20, 2025	Domain AdaptationLanguage Identification	CodeCode Available	0
Multi-label Scandinavian Language Identification (SLIDE)	Feb 10, 2025	Language IdentificationSentence	CodeCode Available	0
On the use of Performer and Agent Attention for Spoken Language Identification	Feb 9, 2025	Language IdentificationSelf-Supervised Learning	—Unverified	0
Evaluating Standard and Dialectal Frisian ASR: Multilingual Fine-tuning and Language Identification for Improved Low-resource Performance	Feb 7, 2025	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified	0
Is It Navajo? Accurate Language Detection in Endangered Athabaskan Languages	Jan 27, 2025	DiversityLanguage Identification	CodeCode Available	0
Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding	Jan 10, 2025	Automatic Speech RecognitionClassification	CodeCode Available	0
Indonesian-English Code-Switching Speech Synthesizer Utilizing Multilingual STEN-TTS and Bert LID	Dec 26, 2024	Language Identificationtext-to-speech	—Unverified	0
Enhancing Code-Switching ASR Leveraging Non-Peaky CTC Loss and Deep Language Posterior Injection	Nov 26, 2024	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified	0
Exploring Facets of Language Generation in the Limit	Nov 22, 2024	Language IdentificationText Generation	—Unverified	0
Can adversarial attacks by large language models be attributed?	Nov 12, 2024	AttributeLanguage Identification	—Unverified	0
Prompt Engineering Using GPT for Word-Level Code-Mixed Language Identification in Low-Resource Dravidian Languages	Nov 6, 2024	Information RetrievalLanguage Identification	—Unverified	0
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages	Oct 31, 2024	Language Identification	CodeCode Available	1
Computational Approaches to Arabic-English Code-Switching	Oct 17, 2024	Data AugmentationLanguage Identification	—Unverified	0
Generation through the lens of learning theory	Oct 17, 2024	Language IdentificationLearning Theory	—Unverified	0
A Multi-Task Text Classification Pipeline with Natural Language Explanations: A User-Centric Evaluation in Sentiment Analysis and Offensive Language Identification in Greek Tweets	Oct 14, 2024	Feature ImportanceLanguage Identification	—Unverified	0
From N-grams to Pre-trained Multilingual Models For Language Identification	Oct 11, 2024	Language IdentificationXLM-R	CodeCode Available	0
Efficiently Identifying Low-Quality Language Subsets in Multilingual Datasets: A Case Study on a Large-Scale Multilingual Audio Dataset	Oct 5, 2024	Language Identification	—Unverified	0
AfriHuBERT: A self-supervised speech representation model for African languages	Sep 30, 2024	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	CodeCode Available	0
Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking	Sep 27, 2024	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	CodeCode Available	0
Leveraging Open-Source Large Language Models for Native Language Identification	Sep 15, 2024	Feature EngineeringLanguage Acquisition	—Unverified	0
Enhancing Code-Switching Speech Recognition with LID-Based Collaborative Mixture of Experts Model	Sep 3, 2024	Language IdentificationMixture-of-Experts	—Unverified	0
Literary and Colloquial Dialect Identification for Tamil using Acoustic Features	Aug 27, 2024	Automatic Speech RecognitionDialect Identification	—Unverified	0
Language-Informed Beam Search Decoding for Multilingual Machine Translation	Aug 11, 2024	Language IdentificationMachine Translation	CodeCode Available	1
Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond	Aug 7, 2024	BenchmarkingLanguage Identification	CodeCode Available	1
Towards Generalized Offensive Language Identification	Jul 26, 2024	Language Identification	—Unverified	0
A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models	Jun 29, 2024	Language IdentificationMachine Translation	—Unverified	0
SC-MoE: Switch Conformer Mixture of Experts for Unified Streaming and Non-streaming Code-Switching ASR	Jun 26, 2024	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified	0
Script-Agnostic Language Identification	Jun 25, 2024	Language Identification	CodeCode Available	0
Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting	Jun 18, 2024	DecoderLanguage Identification	—Unverified	0
An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios	Jun 13, 2024	Language IdentificationSelf-Supervised Learning	CodeCode Available	2
Exploring Spoken Language Identification Strategies for Automatic Transcription of Multilingual Broadcast and Institutional Speech	Jun 13, 2024	Language Identificationspeaker-diarization	—Unverified	0
Soft Language Identification for Language-Agnostic Many-to-One End-to-End Speech Translation	Jun 12, 2024	Language Identification	—Unverified	0
ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets	Jun 12, 2024	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified	0
MaskLID: Code-Switching Language Identification through Iterative Masking	Jun 10, 2024	Language IdentificationSentence	CodeCode Available	1
Malayalam Sign Language Identification using Finetuned YOLOv8 and Computer Vision Techniques	May 8, 2024	Language Identification	—Unverified	0
Whispy: Adapting STT Whisper Models to Real-Time Environments	May 6, 2024	Action DetectionActivity Detection	—Unverified	0
A Federated Learning Approach to Privacy Preserving Offensive Language Identification	Apr 17, 2024	Federated LearningLanguage Identification	—Unverified	0

Show:10 25 50

← PrevPage 1 of 16Next →

All datasets VOXLINGUA107 GlotLID-C Nordic Language Identification OpenSubtitles Universal Dependencies VoxForge

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	wav2vec 2.0 LV-60K	Error rate	7.2	—	Unverified
2	XLS-R	Error rate	5.7	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GlotLID	Macro F1	0.98	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	FastText	Accuracy	0.97	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Apple bi-LSTM	Accuracy	91.37	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Apple bi-LSTM	Accuracy	86.93	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	ConformerG-P	Accuracy	99.8	—	Unverified