SOTAVerified

Speech Recognition

Speech Recognition is the task of converting spoken language into text. It involves recognizing the words spoken in an audio recording and transcribing them into a written format. The goal is to accurately transcribe the speech in real-time or from recorded audio, taking into account factors such as accents, speaking speed, and background noise.

( Image credit: SpecAugment )

Papers

Showing 901950 of 6433 papers

TitleStatusHype
Let SSMs be ConvNets: State-space Modeling with Optimal Tensor ContractionsCode0
A Domain Adaptation Framework for Speech Recognition Systems with Only Synthetic data0
Investigation of Whisper ASR Hallucinations Induced by Non-Speech Audio0
Generative AI and Large Language Models in Language Preservation: Opportunities and Challenges0
Enhancing Neural Spoken Language Recognition: An Exploration with Multilingual Datasets0
A Benchmark of French ASR Systems Based on Error Severity0
GEC-RAG: Improving Generative Error Correction via Retrieval-Augmented Generation for Automatic Speech Recognition Systems0
Automatic Speech Recognition for Sanskrit with Transfer Learning0
Unsupervised Rhythm and Voice Conversion of Dysarthric to Healthy Speech for ASR0
Delayed Fusion: Integrating Large Language Models into First-Pass Decoding in End-to-end Speech Recognition0
PIER: A Novel Metric for Evaluating What Matters in Code-SwitchingCode0
Teaching Wav2Vec2 the Language of the BrainCode0
Adapting Whisper for Regional Dialects: Enhancing Public Services for Vulnerable Populations in the United Kingdom0
persoDA: Personalized Data Augmentation for Personalized ASR0
A Non-autoregressive Model for Joint STT and TTS0
Selective Attention Merging for low resource tasks: A case study of Child ASRCode0
Loudspeaker Beamforming to Enhance Speech Recognition Performance of Voice Driven Applications0
Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model0
AdaCS: Adaptive Normalization for Enhanced Code-Switching ASRCode0
Joint Automatic Speech Recognition And Structure Learning For Better Speech UnderstandingCode0
Speech Recognition for Automatically Assessing Afrikaans and isiXhosa Preschool Oral Narratives0
A Survey on Spoken Italian Datasets and Corpora0
Discrete Speech Unit Extraction via Independent Component AnalysisCode0
Comparing Self-Supervised Learning Models Pre-Trained on Human Speech and Animal Vocalizations for Bioacoustics ProcessingCode0
Benchmarking Rotary Position Embeddings for Automatic Speech Recognition0
Universal-2-TF: Robust All-Neural Text Formatting for ASR0
Contextual ASR Error Handling with LLMs Augmentation for Goal-Oriented Conversational AI0
TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer0
Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language UnderstandingCode0
LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech Recognition0
Methods to Increase the Amount of Data for Speech Recognition for Low Resource Languages0
Universal Speaker Embedding Free Target Speaker Extraction and Personal Voice Activity Detection0
Deep Learning for Pathological Speech: A Survey0
Towards a Generalizable Speech Marker for Parkinson's Disease Diagnosis0
Samba-ASR: State-Of-The-Art Speech Recognition Leveraging Structured State-Space Models0
Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech RecognitionCode0
Improving Transducer-Based Spoken Language Understanding with Self-Conditioned CTC and Knowledge Transfer0
Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal ModelsCode0
Automatic Text Pronunciation Correlation Generation and Application for Contextual Biasing0
Breaking Through the Spike: Spike Window Decoding for Accelerated and Precise Automatic Speech Recognition0
Incremental Dialogue Management: Survey, Discussion, and Implications for HRI0
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale0
Whisper Turns Stronger: Augmenting Wav2Vec 2.0 for Superior ASR in Low-Resource Languages0
Fotheidil: an Automatic Transcription System for the Irish Language0
Enhancing Whisper's Accuracy and Speed for Indian Languages through Prompt-Tuning and Tokenization0
Towards a Single ASR Model That Generalizes to Disordered Speech0
Enhancing Audiovisual Speech Recognition through Bifocal Preference Optimization0
Structured Speaker-Deficiency Adaptation of Foundation Models for Dysarthric and Elderly Speech Recognition0
Speech Recognition With LLMs Adapted to Disordered Speech Using Reinforcement Learning0
Zero-resource Speech Translation and Recognition with LLMs0
Show:102550
← PrevPage 19 of 129Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1AmNetWord Error Rate (WER)8.6Unverified
2HMM-(SAT)GMMWord Error Rate (WER)8Unverified
3Local Prior Matching (Large Model)Word Error Rate (WER)7.19Unverified
4SnipsWord Error Rate (WER)6.4Unverified
5Li-GRUWord Error Rate (WER)6.2Unverified
6HMM-DNN + pNorm*Word Error Rate (WER)5.5Unverified
7CTC + policy learningWord Error Rate (WER)5.42Unverified
8Deep Speech 2Word Error Rate (WER)5.33Unverified
9HMM-TDNN + iVectorsWord Error Rate (WER)4.8Unverified
10Gated ConvNetsWord Error Rate (WER)4.8Unverified
#ModelMetricClaimedVerifiedStatus
1Local Prior Matching (Large Model)Word Error Rate (WER)20.84Unverified
2SnipsWord Error Rate (WER)16.5Unverified
3Local Prior Matching (Large Model, ConvLM LM)Word Error Rate (WER)15.28Unverified
4Deep Speech 2Word Error Rate (WER)13.25Unverified
5TDNN + pNorm + speed up/down speechWord Error Rate (WER)12.5Unverified
6CTC-CRF 4gram-LMWord Error Rate (WER)10.65Unverified
7Convolutional Speech RecognitionWord Error Rate (WER)10.47Unverified
8MT4SSLWord Error Rate (WER)9.6Unverified
9Jasper DR 10x5Word Error Rate (WER)8.79Unverified
10EspressoWord Error Rate (WER)8.7Unverified
#ModelMetricClaimedVerifiedStatus
1Deep SpeechPercentage error20Unverified
2DNN-HMMPercentage error18.5Unverified
3CD-DNNPercentage error16.1Unverified
4DNNPercentage error16Unverified
5DNN + DropoutPercentage error15Unverified
6DNN BMMIPercentage error12.9Unverified
7DNN MPEPercentage error12.9Unverified
8DNN MMIPercentage error12.9Unverified
9HMM-TDNN + pNorm + speed up/down speechPercentage error12.9Unverified
10HMM-DNN +sMBRPercentage error12.6Unverified
#ModelMetricClaimedVerifiedStatus
1LSNNPercentage error33.2Unverified
2LAS multitask with indicators samplingPercentage error20.4Unverified
3Soft Monotonic Attention (ours, offline)Percentage error20.1Unverified
4QCNN-10L-256FMPercentage error19.64Unverified
5Bi-LSTM + skip connections w/ CTCPercentage error17.7Unverified
6Bi-RNN + AttentionPercentage error17.6Unverified
7RNN-CRF on 24(x3) MFSCPercentage error17.3Unverified
8CNN in time and frequency + dropout, 17.6% w/o dropoutPercentage error16.7Unverified
9Light Gated Recurrent UnitsPercentage error16.7Unverified
10GRUPercentage error16.6Unverified
#ModelMetricClaimedVerifiedStatus
1AttWord Error Rate (WER)18.7Unverified
2CTC/AttWord Error Rate (WER)6.7Unverified
3BRA-EWord Error Rate (WER)6.63Unverified
4CTC-CRF 4gram-LMWord Error Rate (WER)6.34Unverified
5BATWord Error Rate (WER)4.97Unverified
6ParaformerWord Error Rate (WER)4.95Unverified
7U2Word Error Rate (WER)4.72Unverified
8UMAWord Error Rate (WER)4.7Unverified
9Lightweight TransducerWord Error Rate (WER)4.31Unverified
10CIF-HKD With LMWord Error Rate (WER)4.1Unverified
#ModelMetricClaimedVerifiedStatus
1Jasper 10x3Word Error Rate (WER)6.9Unverified
2CNN over RAW speech (wav)Word Error Rate (WER)5.6Unverified
3CTC-CRF 4gram-LMWord Error Rate (WER)3.79Unverified
4Deep Speech 2Word Error Rate (WER)3.6Unverified
5test-set on open vocabulary (i.e. harder), model = HMM-DNN + pNorm*Word Error Rate (WER)3.6Unverified
6Convolutional Speech RecognitionWord Error Rate (WER)3.5Unverified
7TC-DNN-BLSTM-DNNWord Error Rate (WER)3.5Unverified
8EspressoWord Error Rate (WER)3.4Unverified
9CTC-CRF VGG-BLSTMWord Error Rate (WER)3.2Unverified
10Transformer with Relaxed AttentionWord Error Rate (WER)3.19Unverified