Language Identification

Language identification is the task of determining the language of a text.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 101–150 of 794 papers

Title	Date	Tasks	Status	Hype
Investigating model performance in language identification: beyond simple error statistics	May 30, 2023	Language Identification	CodeCode Available	0
MERLIon CCS Challenge: A English-Mandarin code-switching child-directed speech corpus for language identification and diarization	May 30, 2023	Language Identification	CodeCode Available	0
Script Normalization for Unconventional Writing of Under-Resourced Languages in Bilingual Communities	May 25, 2023	Language IdentificationMachine Translation	CodeCode Available	0
Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages	May 25, 2023	Language Identification	CodeCode Available	1
An Open Dataset and Model for Language Identification	May 23, 2023	Language Identificationmodel	CodeCode Available	1
LIMIT: Language Identification, Misidentification, and Translation using Hierarchical Models in 350+ Languages	May 23, 2023	Language IdentificationTranslation	CodeCode Available	0
Multilingual Large Language Models Are Not (Yet) Code-Switchers	May 23, 2023	BenchmarkingLanguage Identification	—Unverified	0
Scaling Speech Technology to 1,000+ Languages	May 22, 2023	Automatic Speech RecognitionLanguage Identification	CodeCode Available	1
ML-SUPERB: Multilingual Speech Universal PERformance Benchmark	May 18, 2023	Automatic Speech RecognitionLanguage Identification	—Unverified	0
DocLangID: Improving Few-Shot Training to Identify the Language of Historical Documents	May 3, 2023	Few-Shot LearningLanguage Identification	CodeCode Available	0
Lessons Learned in ATCO2: 5000 hours of Air Traffic Control Communications for Robust Automatic Speech Recognition and Understanding	May 2, 2023	Automatic Speech RecognitionLanguage Identification	—Unverified	0
Approaches to Corpus Creation for Low-Resource Language Technology: the Case of Southern Kurdish and Laki	Apr 3, 2023	Language Identification	CodeCode Available	0
PALI: A Language Identification Benchmark for Perso-Arabic Scripts	Apr 3, 2023	Language Identification	CodeCode Available	1
MMT: A Multilingual and Multi-Topic Indian Social Media Dataset	Apr 2, 2023	DiversityLanguage Identification	—Unverified	0
Joint unsupervised and supervised learning for context-aware language identification	Mar 29, 2023	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified	0
Language Variety Identification with True Labels	Mar 2, 2023	Language Identification	CodeCode Available	0
Building High-accuracy Multilingual ASR with Gated Language Experts and Curriculum Training	Mar 1, 2023	Language Identification	—Unverified	0
Augmented Transformers with Adaptive n-grams Embedding for Multilingual Scene Text Recognition	Feb 28, 2023	Language IdentificationScene Text Recognition	—Unverified	0
Language identification as improvement for lip-based biometric visual systems	Feb 27, 2023	Language Identification	—Unverified	0
Improving Spoken Language Identification with Map-Mix	Feb 16, 2023	Data AugmentationLanguage Identification	CodeCode Available	1
Cross-Corpora Spoken Language Identification with Domain Diversification and Generalization	Feb 10, 2023	Data AugmentationDomain Generalization	—Unverified	0
A Twitter BERT Approach for Offensive Language Detection in Marathi	Dec 20, 2022	Data AugmentationLanguage Identification	—Unverified	0
SOLD: Sinhala Offensive Language Dataset	Dec 1, 2022	Language IdentificationSentence	CodeCode Available	1
An Overview of Indian Spoken Language Recognition from Machine Learning Perspective	Nov 30, 2022	Language IdentificationSpoken language identification	—Unverified	0
Transformer-based Model for Word Level Language Identification in Code-mixed Kannada-English Texts	Nov 26, 2022	Language Identification	—Unverified	0
Predicting the Type and Target of Offensive Social Media Posts in Marathi	Nov 22, 2022	Language Identification	CodeCode Available	0
Overview of the HASOC Subtrack at FIRE 2022: Offensive Language Identification in Marathi	Nov 18, 2022	Language Identification	—Unverified	0
Scaling Native Language Identification with Transformer Adapters	Nov 18, 2022	Language IdentificationMarketing	—Unverified	0
CoLI-Machine Learning Approaches for Code-mixed Language Identification at the Word Level in Kannada-English Texts	Nov 17, 2022	Language IdentificationSentence	—Unverified	0
Accidental Learners: Spoken Language Identification in Multilingual Self-Supervised Models	Nov 9, 2022	Language IdentificationSpoken language identification	—Unverified	0
LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and Translation Using Neural Transducers	Nov 5, 2022	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified	0
A Compact End-to-End Model with Local and Global Context for Spoken Language Identification	Oct 27, 2022	Language IdentificationSpoken language identification	—Unverified	0
AfroLID: A Neural Language Identification Tool for African Languages	Oct 21, 2022	Language Identification	CodeCode Available	1
Italian Language and Dialect Identification and Regional French Variety Detection using Adaptive Naive Bayes	Oct 1, 2022	Dialect IdentificationLanguage Identification	CodeCode Available	0
Neural Networks for Cross-domain Language Identification. Phlyers @Vardial 2022	Oct 1, 2022	Language Identification	—Unverified	0
The Curious Case of Logistic Regression for Italian Languages and Dialects Identification	Oct 1, 2022	Language IdentificationMachine Translation	CodeCode Available	0
OcWikiDisc: a Corpus of Wikipedia Talk Pages in Occitan	Oct 1, 2022	8kLanguage Identification	—Unverified	0
The first neural machine translation system for the Erzya language	Sep 19, 2022	Language IdentificationMachine Translation	CodeCode Available	1
Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification	Sep 13, 2022	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified	0
Evaluation of Off-the-Shelf Language Identification Tools on Bulgarian Social Media Posts	Sep 1, 2022	Language Identification	—Unverified	0
IndicSUPERB: A Speech Processing Universal Performance Benchmark for Indian languages	Aug 24, 2022	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	CodeCode Available	1
Unravelling Interlanguage Facts via Explainable Machine Learning	Aug 2, 2022	BIG-bench Machine LearningLanguage Identification	—Unverified	0
Extending RNN-T-based speech recognition systems with emotion and language classification	Jul 28, 2022	Emotion ClassificationEmotion Recognition	—Unverified	0
Distilled Non-Semantic Speech Embeddings with Binary Neural Networks for Low-Resource Devices	Jul 12, 2022	Emotion RecognitionKeyword Spotting	CodeCode Available	0
Huqariq: A Multilingual Speech Corpus of Native Languages of Peru for Speech Recognition	Jul 12, 2022	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified	0
TechSSN at SemEval-2022 Task 6: Intended Sarcasm Detection using Transformer Models	Jul 1, 2022	Language IdentificationSarcasm Detection	—Unverified	0
TweetNLP: Cutting-Edge Natural Language Processing for Social Media	Jun 29, 2022	Language IdentificationNamed Entity Recognition	CodeCode Available	2
Language Identification for Austronesian Languages	Jun 9, 2022	Language Identification	CodeCode Available	0
Word-level Language Identification Using Subword Embeddings for Code-mixed Bangla-English Social Media Data	Jun 1, 2022	Language IdentificationPOS	CodeCode Available	1
CoSwID, a Code Switching Identification Method Suitable for Under-Resourced Languages	Jun 1, 2022	Language Identification	—Unverified	0

Show:10 25 50

← PrevPage 3 of 16Next →

All datasets VOXLINGUA107 GlotLID-C Nordic Language Identification OpenSubtitles Universal Dependencies VoxForge

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	wav2vec 2.0 LV-60K	Error rate	7.2	—	Unverified
2	XLS-R	Error rate	5.7	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GlotLID	Macro F1	0.98	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	FastText	Accuracy	0.97	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Apple bi-LSTM	Accuracy	91.37	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Apple bi-LSTM	Accuracy	86.93	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	ConformerG-P	Accuracy	99.8	—	Unverified