SOTAVerified

Speech Recognition

Speech Recognition is the task of converting spoken language into text. It involves recognizing the words spoken in an audio recording and transcribing them into a written format. The goal is to accurately transcribe the speech in real-time or from recorded audio, taking into account factors such as accents, speaking speed, and background noise.

( Image credit: SpecAugment )

Papers

Showing 27012750 of 6433 papers

TitleStatusHype
FAT: Training Neural Networks for Reliable Inference Under Hardware Faults0
Arabic Language WEKA-Based Dialect Classifier for Arabic Automatic Speech Recognition Transcripts0
Adversarial Speech Generation and Natural Speech Recovery for Speech Content Protection0
Grammar Based Speaker Role Identification for Air Traffic Control Speech Recognition0
Grammatical vs Spelling Error Correction: An Investigation into the Responsiveness of Transformer-based Language Models using BART and MarianMT0
Granary: Speech Recognition and Translation Dataset in 25 European Languages0
Fast Word Error Rate Estimation Using Self-Supervised Representations for Speech and Text0
Churn Identification in Microblogs using Convolutional Neural Networks with Structured Logical Knowledge0
Fast Text-Only Domain Adaptation of RNN-Transducer Prediction Network0
Graph Meets LLM: A Novel Approach to Collaborative Filtering for Robust Conversational Understanding0
Fast Syntactic Analysis for Statistical Language Modeling via Substructure Sharing and Uptraining0
Chunked Attention-based Encoder-Decoder Model for Streaming Speech Recognition0
GRASS: the Graz corpus of Read And Spontaneous Speech0
Fast Streaming Transducer ASR Prototyping via Knowledge Distillation with Whisper0
Grouping Language Model Boundary Words to Speed K--Best Extraction from Hypergraphs0
Grow and Prune Compact, Fast, and Accurate LSTMs0
Fast Spectrogram Inversion using Multi-head Convolutional Neural Networks0
Guided contrastive self-supervised pre-training for automatic speech recognition0
Faster, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces0
Guided-TTS: A Diffusion Model for Text-to-Speech via Classifier Guidance0
Guiding CTC Posterior Spike Timings for Improved Posterior Fusion and Knowledge Distillation0
CHISPA on the GO: A mobile Chinese-Spanish translation service for travellers in trouble0
Arabic Dialect Processing Tutorial0
Hallucination of speech recognition errors with sequence to sequence learning0
Hallucinations in Neural Automatic Speech Recognition: Identifying Errors and Hallucinatory Models0
Halving transcription time: A fast, user-friendly and GDPR-compliant workflow to create AI-assisted transcripts for content analysis0
Adversarial Speaker Disentanglement Using Unannotated External Data for Self-supervised Representation Based Voice Conversion0
Handwriting recognition for Scottish Gaelic0
A cost minimization approach to fix the vocabulary size in a tokenizer for an End-to-End ASR system0
Hard Sample Mining for the Improved Retraining of Automatic Speech Recognition0
Accented Speech Recognition Inspired by Human Perception0
Fast, simple and accurate handwritten digit classification by training shallow neural network classifiers with the 'extreme learning machine' algorithm0
Chipmunk: A Systolically Scalable 0.9 mm^2, 3.08 Gop/s/mW @ 1.2 mW Accelerator for Near-Sensor Recurrent Neural Network Inference0
Fast Real-time Personalized Speech Enhancement: End-to-End Enhancement Network (E3Net) and Knowledge Distillation0
HARK Side of Deep Learning -- From Grad Student Descent to Automated Machine Learning0
Fast offline Transformer-based end-to-end automatic speech recognition for real-world applications0
Chinese Medical Speech Recognition with Punctuated Hypothesis0
Arabic Diacritization with Recurrent Neural Networks0
Harnessing Transfer Learning from Swahili: Advancing Solutions for Comorian Dialects0
HASP: A High-Performance Adaptive Mobile Security Enhancement Against Malicious Speech Recognition0
HausaNLP: Current Status, Challenges and Future Directions for Hausa Natural Language Processing0
Have best of both worlds: two-pass hybrid and E2E cascading framework for speech recognition0
Head-synchronous Decoding for Transformer-based Streaming ASR0
Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers0
Fast Node Embeddings: Learning Ego-Centric Representations0
Hearings and mishearings: decrypting the spoken word0
Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides0
Hear "No Evil", See "Kenansville": Efficient and Transferable Black-Box Attacks on Speech Recognition and Voice Identification Systems0
Hear No Evil: Towards Adversarial Robustness of Automatic Speech Recognition via Multi-Task Learning0
Fast Labeling and Transcription with the Speechalyzer Toolkit0
Show:102550
← PrevPage 55 of 129Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1AmNetWord Error Rate (WER)8.6Unverified
2HMM-(SAT)GMMWord Error Rate (WER)8Unverified
3Local Prior Matching (Large Model)Word Error Rate (WER)7.19Unverified
4SnipsWord Error Rate (WER)6.4Unverified
5Li-GRUWord Error Rate (WER)6.2Unverified
6HMM-DNN + pNorm*Word Error Rate (WER)5.5Unverified
7CTC + policy learningWord Error Rate (WER)5.42Unverified
8Deep Speech 2Word Error Rate (WER)5.33Unverified
9HMM-TDNN + iVectorsWord Error Rate (WER)4.8Unverified
10Gated ConvNetsWord Error Rate (WER)4.8Unverified
#ModelMetricClaimedVerifiedStatus
1Local Prior Matching (Large Model)Word Error Rate (WER)20.84Unverified
2SnipsWord Error Rate (WER)16.5Unverified
3Local Prior Matching (Large Model, ConvLM LM)Word Error Rate (WER)15.28Unverified
4Deep Speech 2Word Error Rate (WER)13.25Unverified
5TDNN + pNorm + speed up/down speechWord Error Rate (WER)12.5Unverified
6CTC-CRF 4gram-LMWord Error Rate (WER)10.65Unverified
7Convolutional Speech RecognitionWord Error Rate (WER)10.47Unverified
8MT4SSLWord Error Rate (WER)9.6Unverified
9Jasper DR 10x5Word Error Rate (WER)8.79Unverified
10EspressoWord Error Rate (WER)8.7Unverified
#ModelMetricClaimedVerifiedStatus
1Deep SpeechPercentage error20Unverified
2DNN-HMMPercentage error18.5Unverified
3CD-DNNPercentage error16.1Unverified
4DNNPercentage error16Unverified
5DNN + DropoutPercentage error15Unverified
6DNN BMMIPercentage error12.9Unverified
7DNN MPEPercentage error12.9Unverified
8DNN MMIPercentage error12.9Unverified
9HMM-TDNN + pNorm + speed up/down speechPercentage error12.9Unverified
10HMM-DNN +sMBRPercentage error12.6Unverified
#ModelMetricClaimedVerifiedStatus
1LSNNPercentage error33.2Unverified
2LAS multitask with indicators samplingPercentage error20.4Unverified
3Soft Monotonic Attention (ours, offline)Percentage error20.1Unverified
4QCNN-10L-256FMPercentage error19.64Unverified
5Bi-LSTM + skip connections w/ CTCPercentage error17.7Unverified
6Bi-RNN + AttentionPercentage error17.6Unverified
7RNN-CRF on 24(x3) MFSCPercentage error17.3Unverified
8CNN in time and frequency + dropout, 17.6% w/o dropoutPercentage error16.7Unverified
9Light Gated Recurrent UnitsPercentage error16.7Unverified
10GRUPercentage error16.6Unverified
#ModelMetricClaimedVerifiedStatus
1AttWord Error Rate (WER)18.7Unverified
2CTC/AttWord Error Rate (WER)6.7Unverified
3BRA-EWord Error Rate (WER)6.63Unverified
4CTC-CRF 4gram-LMWord Error Rate (WER)6.34Unverified
5BATWord Error Rate (WER)4.97Unverified
6ParaformerWord Error Rate (WER)4.95Unverified
7U2Word Error Rate (WER)4.72Unverified
8UMAWord Error Rate (WER)4.7Unverified
9Lightweight TransducerWord Error Rate (WER)4.31Unverified
10CIF-HKD With LMWord Error Rate (WER)4.1Unverified
#ModelMetricClaimedVerifiedStatus
1Jasper 10x3Word Error Rate (WER)6.9Unverified
2CNN over RAW speech (wav)Word Error Rate (WER)5.6Unverified
3CTC-CRF 4gram-LMWord Error Rate (WER)3.79Unverified
4Deep Speech 2Word Error Rate (WER)3.6Unverified
5test-set on open vocabulary (i.e. harder), model = HMM-DNN + pNorm*Word Error Rate (WER)3.6Unverified
6Convolutional Speech RecognitionWord Error Rate (WER)3.5Unverified
7TC-DNN-BLSTM-DNNWord Error Rate (WER)3.5Unverified
8EspressoWord Error Rate (WER)3.4Unverified
9CTC-CRF VGG-BLSTMWord Error Rate (WER)3.2Unverified
10Transformer with Relaxed AttentionWord Error Rate (WER)3.19Unverified