SOTAVerified

Language Identification

Language identification is the task of determining the language of a text.

Papers

Showing 150 of 794 papers

TitleStatusHype
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-trainingCode11
An Initial Investigation of Language Adaptation for TTS Systems under Low-resource ScenariosCode2
MathPile: A Billion-Token-Scale Pretraining Corpus for MathCode2
TweetNLP: Cutting-Edge Natural Language Processing for Social MediaCode2
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority LanguagesCode1
Language-Informed Beam Search Decoding for Multilingual Machine TranslationCode1
Speech-MASSIVE: A Multilingual Speech Dataset for SLU and BeyondCode1
MaskLID: Code-Switching Language Identification through Iterative MaskingCode1
FastSpell: the LangId Magic SpellCode1
Language and Speech Technology for Central Kurdish VarietiesCode1
KInIT at SemEval-2024 Task 8: Fine-tuned LLMs for Multilingual Machine-Generated Text DetectionCode1
AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data FiltersCode1
GlotLID: Language Identification for Low-Resource LanguagesCode1
Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from WhisperCode1
Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languagesCode1
An Open Dataset and Model for Language IdentificationCode1
Scaling Speech Technology to 1,000+ LanguagesCode1
PALI: A Language Identification Benchmark for Perso-Arabic ScriptsCode1
Improving Spoken Language Identification with Map-MixCode1
SOLD: Sinhala Offensive Language DatasetCode1
AfroLID: A Neural Language Identification Tool for African LanguagesCode1
The first neural machine translation system for the Erzya languageCode1
IndicSUPERB: A Speech Processing Universal Performance Benchmark for Indian languagesCode1
Word-level Language Identification Using Subword Embeddings for Code-mixed Bangla-English Social Media DataCode1
L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and BERT Language ModelsCode1
PHO-LID: A Unified Model Incorporating Acoustic-Phonetic and Phonotactic Information for Language IdentificationCode1
BERT-LID: Leveraging BERT to Improve Spoken Language IdentificationCode1
XLS-R: Self-supervised Cross-lingual Speech Representation Learning at ScaleCode1
Hyperseed: Unsupervised Learning with Vector Symbolic ArchitecturesCode1
DravidianCodeMix: Sentiment Analysis and Offensive Language Identification Dataset for Dravidian Languages in Code-Mixed TextCode1
SpeechBrain: A General-Purpose Speech ToolkitCode1
Using Radio Archives for Low-Resource Speech Recognition: Towards an Intelligent Virtual Assistant for Illiterate UsersCode1
A reproduction of Apple's bi-directional LSTM models for language identification in short stringsCode1
Triplet Entropy Loss: Improving The Generalisation of Short Speech Language Identification SystemsCode1
VoxLingua107: a Dataset for Spoken Language RecognitionCode1
Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text CorpusCode1
NLPDove at SemEval-2020 Task 12: Improving Offensive Language Detection with Cross-lingual TransferCode1
KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social MediaCode1
Common Voice: A Massively-Multilingual Speech CorpusCode1
mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks0
Neighbors and relatives: How do speech embeddings reflect linguistic connections across the world?0
Recursive Semantic Anchoring in ISO 639:2023: A Structural Extension to ISO/TC 37 Frameworks0
TalTech Systems for the Interspeech 2025 ML-SUPERB 2.0 Challenge0
Improving Multilingual Speech Models on ML-SUPERB 2.0: Fine-tuning with Data Augmentation and LID-Aware CTC0
Token Masking Improves Transformer-Based Text Classification0
Advancing Uto-Aztecan Language Technologies: A Case Study on the Endangered Comanche LanguageCode0
Improving Informally Romanized Language Identification0
(Im)possibility of Automated Hallucination Detection in Large Language Models0
COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing0
KréyoLID From Language Identification Towards Language MiningCode0
Show:102550
← PrevPage 1 of 16Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1wav2vec 2.0 LV-60KError rate7.2Unverified
2XLS-RError rate5.7Unverified
#ModelMetricClaimedVerifiedStatus
1GlotLIDMacro F10.98Unverified
#ModelMetricClaimedVerifiedStatus
1FastTextAccuracy0.97Unverified
#ModelMetricClaimedVerifiedStatus
1Apple bi-LSTMAccuracy91.37Unverified
#ModelMetricClaimedVerifiedStatus
1Apple bi-LSTMAccuracy86.93Unverified
#ModelMetricClaimedVerifiedStatus
1ConformerG-PAccuracy99.8Unverified