SOTAVerified

Language Identification

Language identification is the task of determining the language of a text.

Papers

Showing 150 of 794 papers

TitleStatusHype
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-trainingCode11
TweetNLP: Cutting-Edge Natural Language Processing for Social MediaCode2
An Initial Investigation of Language Adaptation for TTS Systems under Low-resource ScenariosCode2
MathPile: A Billion-Token-Scale Pretraining Corpus for MathCode2
XLS-R: Self-supervised Cross-lingual Speech Representation Learning at ScaleCode1
NLPDove at SemEval-2020 Task 12: Improving Offensive Language Detection with Cross-lingual TransferCode1
AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data FiltersCode1
Scaling Speech Technology to 1,000+ LanguagesCode1
Using Radio Archives for Low-Resource Speech Recognition: Towards an Intelligent Virtual Assistant for Illiterate UsersCode1
Speech-MASSIVE: A Multilingual Speech Dataset for SLU and BeyondCode1
FastSpell: the LangId Magic SpellCode1
AfroLID: A Neural Language Identification Tool for African LanguagesCode1
SpeechBrain: A General-Purpose Speech ToolkitCode1
PALI: A Language Identification Benchmark for Perso-Arabic ScriptsCode1
KInIT at SemEval-2024 Task 8: Fine-tuned LLMs for Multilingual Machine-Generated Text DetectionCode1
Language and Speech Technology for Central Kurdish VarietiesCode1
Improving Spoken Language Identification with Map-MixCode1
L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and BERT Language ModelsCode1
MaskLID: Code-Switching Language Identification through Iterative MaskingCode1
PHO-LID: A Unified Model Incorporating Acoustic-Phonetic and Phonotactic Information for Language IdentificationCode1
SOLD: Sinhala Offensive Language DatasetCode1
Triplet Entropy Loss: Improving The Generalisation of Short Speech Language Identification SystemsCode1
GlotLID: Language Identification for Low-Resource LanguagesCode1
Word-level Language Identification Using Subword Embeddings for Code-mixed Bangla-English Social Media DataCode1
IndicSUPERB: A Speech Processing Universal Performance Benchmark for Indian languagesCode1
VoxLingua107: a Dataset for Spoken Language RecognitionCode1
DravidianCodeMix: Sentiment Analysis and Offensive Language Identification Dataset for Dravidian Languages in Code-Mixed TextCode1
The first neural machine translation system for the Erzya languageCode1
Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text CorpusCode1
Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from WhisperCode1
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority LanguagesCode1
An Open Dataset and Model for Language IdentificationCode1
A reproduction of Apple's bi-directional LSTM models for language identification in short stringsCode1
Hyperseed: Unsupervised Learning with Vector Symbolic ArchitecturesCode1
BERT-LID: Leveraging BERT to Improve Spoken Language IdentificationCode1
Common Voice: A Massively-Multilingual Speech CorpusCode1
Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languagesCode1
KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social MediaCode1
Language-Informed Beam Search Decoding for Multilingual Machine TranslationCode1
GeezSwitch: Language Identification in Typologically Related Low-resourced East African LanguagesCode0
Geographic Adaptation of Pretrained Language ModelsCode0
From English to Code-Switching: Transfer Learning with Strong Morphological CluesCode0
Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language UnderstandingCode0
From N-grams to Pre-trained Multilingual Models For Language IdentificationCode0
Geographically-Informed Language IdentificationCode0
Aggressive Language Identification Using Word Embeddings and Sentiment FeaturesCode0
Finding Structure in Text, Genome and Other Symbolic SequencesCode0
An Investigation into the Contribution of Locally Aggregated Descriptors to Figurative Language IdentificationCode0
AfriHuBERT: A self-supervised speech representation model for African languagesCode0
FBK-DH at SemEval-2020 Task 12: Using Multi-channel BERT for Multilingual Offensive Language DetectionCode0
Show:102550
← PrevPage 1 of 16Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1wav2vec 2.0 LV-60KError rate7.2Unverified
2XLS-RError rate5.7Unverified
#ModelMetricClaimedVerifiedStatus
1GlotLIDMacro F10.98Unverified
#ModelMetricClaimedVerifiedStatus
1FastTextAccuracy0.97Unverified
#ModelMetricClaimedVerifiedStatus
1Apple bi-LSTMAccuracy91.37Unverified
#ModelMetricClaimedVerifiedStatus
1Apple bi-LSTMAccuracy86.93Unverified
#ModelMetricClaimedVerifiedStatus
1ConformerG-PAccuracy99.8Unverified