| CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training | May 23, 2025 | Automatic Speech RecognitionEmotion Recognition | CodeCode Available | 11 | 5 |
| TweetNLP: Cutting-Edge Natural Language Processing for Social Media | Jun 29, 2022 | Language IdentificationNamed Entity Recognition | CodeCode Available | 2 | 5 |
| An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios | Jun 13, 2024 | Language IdentificationSelf-Supervised Learning | CodeCode Available | 2 | 5 |
| MathPile: A Billion-Token-Scale Pretraining Corpus for Math | Dec 28, 2023 | Language IdentificationMath | CodeCode Available | 2 | 5 |
| XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale | Nov 17, 2021 | Language IdentificationRepresentation Learning | CodeCode Available | 1 | 5 |
| NLPDove at SemEval-2020 Task 12: Improving Offensive Language Detection with Cross-lingual Transfer | Aug 4, 2020 | Cross-Lingual TransferData Augmentation | CodeCode Available | 1 | 5 |
| AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters | Jan 12, 2024 | Language Identification | CodeCode Available | 1 | 5 |
| Scaling Speech Technology to 1,000+ Languages | May 22, 2023 | Automatic Speech RecognitionLanguage Identification | CodeCode Available | 1 | 5 |
| Using Radio Archives for Low-Resource Speech Recognition: Towards an Intelligent Virtual Assistant for Illiterate Users | Apr 27, 2021 | Language IdentificationRepresentation Learning | CodeCode Available | 1 | 5 |
| Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond | Aug 7, 2024 | BenchmarkingLanguage Identification | CodeCode Available | 1 | 5 |
| FastSpell: the LangId Magic Spell | Apr 12, 2024 | Language Identification | CodeCode Available | 1 | 5 |
| AfroLID: A Neural Language Identification Tool for African Languages | Oct 21, 2022 | Language Identification | CodeCode Available | 1 | 5 |
| SpeechBrain: A General-Purpose Speech Toolkit | Jun 8, 2021 | Language IdentificationSpoken Language Understanding | CodeCode Available | 1 | 5 |
| PALI: A Language Identification Benchmark for Perso-Arabic Scripts | Apr 3, 2023 | Language Identification | CodeCode Available | 1 | 5 |
| KInIT at SemEval-2024 Task 8: Fine-tuned LLMs for Multilingual Machine-Generated Text Detection | Feb 21, 2024 | Language Identificationparameter-efficient fine-tuning | CodeCode Available | 1 | 5 |
| Language and Speech Technology for Central Kurdish Varieties | Mar 4, 2024 | Automatic Speech RecognitionDiversity | CodeCode Available | 1 | 5 |
| Improving Spoken Language Identification with Map-Mix | Feb 16, 2023 | Data AugmentationLanguage Identification | CodeCode Available | 1 | 5 |
| L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and BERT Language Models | Apr 18, 2022 | Language IdentificationLanguage Modelling | CodeCode Available | 1 | 5 |
| MaskLID: Code-Switching Language Identification through Iterative Masking | Jun 10, 2024 | Language IdentificationSentence | CodeCode Available | 1 | 5 |
| PHO-LID: A Unified Model Incorporating Acoustic-Phonetic and Phonotactic Information for Language Identification | Mar 23, 2022 | Language Identification | CodeCode Available | 1 | 5 |
| SOLD: Sinhala Offensive Language Dataset | Dec 1, 2022 | Language IdentificationSentence | CodeCode Available | 1 | 5 |
| Triplet Entropy Loss: Improving The Generalisation of Short Speech Language Identification Systems | Dec 3, 2020 | Language IdentificationSpeech Language Identification | CodeCode Available | 1 | 5 |
| GlotLID: Language Identification for Low-Resource Languages | Oct 24, 2023 | Dialect IdentificationLanguage Identification | CodeCode Available | 1 | 5 |
| Word-level Language Identification Using Subword Embeddings for Code-mixed Bangla-English Social Media Data | Jun 1, 2022 | Language IdentificationPOS | CodeCode Available | 1 | 5 |
| IndicSUPERB: A Speech Processing Universal Performance Benchmark for Indian languages | Aug 24, 2022 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | CodeCode Available | 1 | 5 |
| VoxLingua107: a Dataset for Spoken Language Recognition | Nov 25, 2020 | Action DetectionActivity Detection | CodeCode Available | 1 | 5 |
| DravidianCodeMix: Sentiment Analysis and Offensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text | Jun 17, 2021 | Language IdentificationSentiment Analysis | CodeCode Available | 1 | 5 |
| The first neural machine translation system for the Erzya language | Sep 19, 2022 | Language IdentificationMachine Translation | CodeCode Available | 1 | 5 |
| Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus | Oct 27, 2020 | Language Identification | CodeCode Available | 1 | 5 |
| Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from Whisper | Sep 15, 2023 | Language Identificationspeech-recognition | CodeCode Available | 1 | 5 |
| GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages | Oct 31, 2024 | Language Identification | CodeCode Available | 1 | 5 |
| An Open Dataset and Model for Language Identification | May 23, 2023 | Language Identificationmodel | CodeCode Available | 1 | 5 |
| A reproduction of Apple's bi-directional LSTM models for language identification in short strings | Feb 11, 2021 | Language Identification | CodeCode Available | 1 | 5 |
| Hyperseed: Unsupervised Learning with Vector Symbolic Architectures | Oct 15, 2021 | Few-Shot LearningLanguage Identification | CodeCode Available | 1 | 5 |
| BERT-LID: Leveraging BERT to Improve Spoken Language Identification | Mar 1, 2022 | Language IdentificationSpoken language identification | CodeCode Available | 1 | 5 |
| Common Voice: A Massively-Multilingual Speech Corpus | Dec 13, 2019 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | CodeCode Available | 1 | 5 |
| Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages | May 25, 2023 | Language Identification | CodeCode Available | 1 | 5 |
| KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media | Jul 26, 2020 | Abuse DetectionLanguage Identification | CodeCode Available | 1 | 5 |
| Language-Informed Beam Search Decoding for Multilingual Machine Translation | Aug 11, 2024 | Language IdentificationMachine Translation | CodeCode Available | 1 | 5 |
| GeezSwitch: Language Identification in Typologically Related Low-resourced East African Languages | Jun 1, 2022 | Language IdentificationMachine Translation | CodeCode Available | 0 | 5 |
| Geographic Adaptation of Pretrained Language Models | Mar 16, 2022 | Language IdentificationLanguage Modeling | CodeCode Available | 0 | 5 |
| From English to Code-Switching: Transfer Learning with Strong Morphological Clues | Sep 11, 2019 | Language IdentificationNamed Entity Recognition (NER) | CodeCode Available | 0 | 5 |
| Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding | Jan 10, 2025 | Automatic Speech RecognitionClassification | CodeCode Available | 0 | 5 |
| From N-grams to Pre-trained Multilingual Models For Language Identification | Oct 11, 2024 | Language IdentificationXLM-R | CodeCode Available | 0 | 5 |
| Geographically-Informed Language Identification | Mar 14, 2024 | Language Identification | CodeCode Available | 0 | 5 |
| Aggressive Language Identification Using Word Embeddings and Sentiment Features | Aug 1, 2018 | Aggression IdentificationBIG-bench Machine Learning | CodeCode Available | 0 | 5 |
| Finding Structure in Text, Genome and Other Symbolic Sequences | Jul 8, 2012 | Information RetrievalLanguage Identification | CodeCode Available | 0 | 5 |
| An Investigation into the Contribution of Locally Aggregated Descriptors to Figurative Language Identification | Nov 1, 2021 | Language IdentificationNatural Language Understanding | CodeCode Available | 0 | 5 |
| AfriHuBERT: A self-supervised speech representation model for African languages | Sep 30, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | CodeCode Available | 0 | 5 |
| FBK-DH at SemEval-2020 Task 12: Using Multi-channel BERT for Multilingual Offensive Language Detection | Dec 1, 2020 | Language IdentificationMachine Translation | CodeCode Available | 0 | 5 |