SOTAVerified

Speech Recognition

Speech Recognition is the task of converting spoken language into text. It involves recognizing the words spoken in an audio recording and transcribing them into a written format. The goal is to accurately transcribe the speech in real-time or from recorded audio, taking into account factors such as accents, speaking speed, and background noise.

( Image credit: SpecAugment )

Papers

Showing 701750 of 6433 papers

TitleStatusHype
Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition SystemCode1
Tamil Language Computing: the Present and the Future0
Dynamic Encoder Size Based on Data-Driven Layer-wise Pruning for Speech Recognition0
Explaining Spectrograms in Machine Learning: A Study on Neural Networks for Speech ClassificationCode0
HebDB: a Weakly Supervised Dataset for Hebrew Speech Processing0
Evaluating Voice Command Pipelines for Drone Control: From STT and LLM to Direct Classification and Siamese Networks0
A voice and speech corpus of patients who underwent upper airway surgery in pre- and post-operative statesCode0
Tailored Design of Audio-Visual Speech Recognition Models using BranchformersCode1
Analyzing Speech Unit Selection for Textless Speech-to-Speech Translation0
Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation0
Morse Code-Enabled Speech Recognition for Individuals with Visual and Hearing Impairments0
CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic TokensCode11
Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition0
XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models0
Semi-supervised Learning for Code-Switching ASR with Large Language Model Filter0
LearnerVoice: A Dataset of Non-Native English Learners' Spontaneous Speech0
Written Term Detection Improves Spoken Term DetectionCode0
Pretraining End-to-End Keyword Search with Automatically Discovered Acoustic UnitsCode2
Romanization Encoding For Multilingual ASR0
Performance Analysis of Speech Encoders for Low-Resource SLU and ASR in Tunisian Dialect0
Multitaper mel-spectrograms for keyword spotting0
Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Speech Foundation ModelsCode1
Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of Language Models0
Finetuning End-to-End Models for Estonian Conversational Spoken Language Translation0
Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech RecognitionCode1
Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis0
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMsCode11
Improving Self-supervised Pre-training using Accent-Specific CodebooksCode1
Serialized Output Training by Learned Dominance0
Multi-Convformer: Extending Conformer with Multiple Convolution Kernels0
Self-supervised ASR Models and Features For Dysarthric and Elderly Speech Recognition0
Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations0
Advanced Framework for Animal Sound Classification With Features Optimization0
Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech Recognition0
The USTC-NERCSLIP Systems for The ICMC-ASR Challenge0
Towards the Next Frontier in Speech Representation Learning Using Disentanglement0
Pinyin Regularization in Error Correction for Chinese Speech Recognition with Large Language ModelsCode1
Toward Automated Detection of Biased Social Signals from the Content of Clinical Conversations0
Cross-Lingual Transfer Learning for Speech Translation0
Less Forgetting for Better Generalization: Exploring Continual-learning Fine-tuning Methods for Speech Self-supervised Representations0
Error Correction by Paying Attention to Both Acoustic and Confidence References for Automatic Speech Recognition0
Open-Source Conversational AI with SpeechBrain 1.00
Less is More: Accurate Speech Recognition & Translation without Web-Scale Data0
Voices Unheard: NLP Resources and Models for Yorùbá Regional DialectsCode0
Tradition or Innovation: A Comparison of Modern ASR Methods for Forced Alignment0
Enhanced ASR Robustness to Packet Loss with a Front-End Adaptation NetworkCode0
Applying LLMs for Rescoring N-best ASR Hypotheses of Casual Conversations: Effects of Domain Adaptation and Context Carry-over0
SC-MoE: Switch Conformer Mixture of Experts for Unified Streaming and Non-streaming Code-Switching ASR0
Dynamic Data Pruning for Automatic Speech Recognition0
Automatic Speech Recognition for Hindi0
Show:102550
← PrevPage 15 of 129Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1AmNetWord Error Rate (WER)8.6Unverified
2HMM-(SAT)GMMWord Error Rate (WER)8Unverified
3Local Prior Matching (Large Model)Word Error Rate (WER)7.19Unverified
4SnipsWord Error Rate (WER)6.4Unverified
5Li-GRUWord Error Rate (WER)6.2Unverified
6HMM-DNN + pNorm*Word Error Rate (WER)5.5Unverified
7CTC + policy learningWord Error Rate (WER)5.42Unverified
8Deep Speech 2Word Error Rate (WER)5.33Unverified
9Gated ConvNetsWord Error Rate (WER)4.8Unverified
10HMM-TDNN + iVectorsWord Error Rate (WER)4.8Unverified
#ModelMetricClaimedVerifiedStatus
1Local Prior Matching (Large Model)Word Error Rate (WER)20.84Unverified
2SnipsWord Error Rate (WER)16.5Unverified
3Local Prior Matching (Large Model, ConvLM LM)Word Error Rate (WER)15.28Unverified
4Deep Speech 2Word Error Rate (WER)13.25Unverified
5TDNN + pNorm + speed up/down speechWord Error Rate (WER)12.5Unverified
6CTC-CRF 4gram-LMWord Error Rate (WER)10.65Unverified
7Convolutional Speech RecognitionWord Error Rate (WER)10.47Unverified
8MT4SSLWord Error Rate (WER)9.6Unverified
9Jasper DR 10x5Word Error Rate (WER)8.79Unverified
10EspressoWord Error Rate (WER)8.7Unverified
#ModelMetricClaimedVerifiedStatus
1Deep SpeechPercentage error20Unverified
2DNN-HMMPercentage error18.5Unverified
3CD-DNNPercentage error16.1Unverified
4DNNPercentage error16Unverified
5DNN + DropoutPercentage error15Unverified
6DNN BMMIPercentage error12.9Unverified
7HMM-TDNN + pNorm + speed up/down speechPercentage error12.9Unverified
8DNN MPEPercentage error12.9Unverified
9DNN MMIPercentage error12.9Unverified
10CNN + Bi-RNN + CTC (speech to letters), 25.9% WER if trainedonlyon SWBPercentage error12.6Unverified
#ModelMetricClaimedVerifiedStatus
1LSNNPercentage error33.2Unverified
2LAS multitask with indicators samplingPercentage error20.4Unverified
3Soft Monotonic Attention (ours, offline)Percentage error20.1Unverified
4QCNN-10L-256FMPercentage error19.64Unverified
5Bi-LSTM + skip connections w/ CTCPercentage error17.7Unverified
6Bi-RNN + AttentionPercentage error17.6Unverified
7RNN-CRF on 24(x3) MFSCPercentage error17.3Unverified
8CNN in time and frequency + dropout, 17.6% w/o dropoutPercentage error16.7Unverified
9Light Gated Recurrent UnitsPercentage error16.7Unverified
10GRUPercentage error16.6Unverified
#ModelMetricClaimedVerifiedStatus
1AttWord Error Rate (WER)18.7Unverified
2CTC/AttWord Error Rate (WER)6.7Unverified
3BRA-EWord Error Rate (WER)6.63Unverified
4CTC-CRF 4gram-LMWord Error Rate (WER)6.34Unverified
5BATWord Error Rate (WER)4.97Unverified
6ParaformerWord Error Rate (WER)4.95Unverified
7U2Word Error Rate (WER)4.72Unverified
8UMAWord Error Rate (WER)4.7Unverified
9Lightweight TransducerWord Error Rate (WER)4.31Unverified
10CIF-HKD With LMWord Error Rate (WER)4.1Unverified
#ModelMetricClaimedVerifiedStatus
1Jasper 10x3Word Error Rate (WER)6.9Unverified
2CNN over RAW speech (wav)Word Error Rate (WER)5.6Unverified
3CTC-CRF 4gram-LMWord Error Rate (WER)3.79Unverified
4Deep Speech 2Word Error Rate (WER)3.6Unverified
5test-set on open vocabulary (i.e. harder), model = HMM-DNN + pNorm*Word Error Rate (WER)3.6Unverified
6TC-DNN-BLSTM-DNNWord Error Rate (WER)3.5Unverified
7Convolutional Speech RecognitionWord Error Rate (WER)3.5Unverified
8EspressoWord Error Rate (WER)3.4Unverified
9CTC-CRF VGG-BLSTMWord Error Rate (WER)3.2Unverified
10Transformer with Relaxed AttentionWord Error Rate (WER)3.19Unverified