SOTAVerified

Speech Recognition

Speech Recognition is the task of converting spoken language into text. It involves recognizing the words spoken in an audio recording and transcribing them into a written format. The goal is to accurately transcribe the speech in real-time or from recorded audio, taking into account factors such as accents, speaking speed, and background noise.

( Image credit: SpecAugment )

Papers

Showing 851900 of 6433 papers

TitleStatusHype
Enhancing CTC-based speech recognition with diverse modeling units0
StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task LearningCode5
Text Injection for Neural Contextual Biasing0
Task Arithmetic can Mitigate Synthetic-to-Real Gap in Automatic Speech Recognition0
Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing0
Keyword-Guided Adaptation of Automatic Speech Recognition0
Efficiently Train ASR Models that Memorize Less and Perform Better with Per-core Clipping0
Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision0
Enabling ASR for Low-Resource Languages: A Comprehensive Dataset Creation Approach0
Compute-Efficient Medical Image Classification with Softmax-Free Transformers and Sequence Normalization0
YODAS: Youtube-Oriented Dataset for Audio and Speech0
Wav2Prompt: End-to-End Speech Prompt Generation and Tuning For LLM in Zero and Few-shot Learning0
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities0
Augmented Conversation with Embedded Speech-Driven On-the-Fly Referencing in AR0
NUTS, NARS, and Speech0
Intelligent Clinical Documentation: Harnessing Generative AI for Patient-Centric Clinical Note Generation0
TransVIP: Speech to Speech Translation System with Voice and Isochrony PreservationCode2
A Variance-Preserving Interpolation Approach for Diffusion Models with Applications to Single Channel Speech Enhancement and RecognitionCode1
Federating Dynamic Models using Early-Exit Architectures for Automatic Speech Recognition on Heterogeneous ClientsCode0
Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition0
Contrastive and Consistency Learning for Neural Noisy-Channel Model in Spoken Language UnderstandingCode0
Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation ModelsCode3
A Survey on Vision-Language-Action Models for Embodied AICode4
Let's Fuse Step by Step: A Generative Fusion Decoding Algorithm with LLMs for Multi-modal Text RecognitionCode2
Contextualized Automatic Speech Recognition with Dynamic Vocabulary0
ST-Gait++: Leveraging spatio-temporal convolutions for gait-based emotion recognition on videos0
Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation0
You don't understand me!: Comparing ASR results for L1 and L2 speakers of Swedish0
Non-autoregressive real-time Accent Conversion model with voice cloning0
FairLENS: Assessing Fairness in Law Enforcement Speech Recognition0
Mamba in Speech: Towards an Alternative to Self-AttentionCode2
Could a Computer Architect Understand our Brain?0
FAdam: Adam is a natural gradient optimizer using diagonal empirical Fisher informationCode2
Continuous Sign Language Recognition with Adapted Conformer via Unsupervised Pretraining0
Acoustic modeling for Overlapping Speech Recognition: JHU Chime-5 Challenge SystemCode4
Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models0
Continued Pretraining for Domain Adaptation of Wav2vec2.0 in Automatic Speech Recognition for Elementary Math Classroom Settings0
No More Mumbles: Enhancing Robot Intelligibility through Speech AdaptationCode0
Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style Transfer0
Sonos Voice Control Bias Assessment Dataset: A Methodology for Demographic Bias Assessment in Voice Assistants0
Investigating the 'Autoencoder Behavior' in Speech Self-Supervised Models: a focus on HuBERT's Pretraining0
SpeechVerse: A Large-scale Generalizable Audio Language Model0
Rene: A Pre-trained Multi-modal Architecture for Auscultation of Respiratory DiseasesCode0
Large Language Models for Education: A Survey0
SoccerNet-Echoes: A Soccer Game Audio Commentary DatasetCode1
Watch Your Mouth: Silent Speech Recognition with Depth SensingCode1
Lost in Transcription: Identifying and Quantifying the Accuracy Biases of Automatic Speech Recognition Systems Against Disfluent Speech0
DP-DyLoRA: Fine-Tuning Transformer-Based Models On-Device under Differentially Private Federated Learning using Dynamic Low-Rank Adaptation0
Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation ModelsCode1
Audio-Visual Speech Recognition based on Regulated Transformer and Spatio-Temporal Fusion Strategy for Driver Assistive SystemsCode0
Show:102550
← PrevPage 18 of 129Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1AmNetWord Error Rate (WER)8.6Unverified
2HMM-(SAT)GMMWord Error Rate (WER)8Unverified
3Local Prior Matching (Large Model)Word Error Rate (WER)7.19Unverified
4SnipsWord Error Rate (WER)6.4Unverified
5Li-GRUWord Error Rate (WER)6.2Unverified
6HMM-DNN + pNorm*Word Error Rate (WER)5.5Unverified
7CTC + policy learningWord Error Rate (WER)5.42Unverified
8Deep Speech 2Word Error Rate (WER)5.33Unverified
9HMM-TDNN + iVectorsWord Error Rate (WER)4.8Unverified
10Gated ConvNetsWord Error Rate (WER)4.8Unverified
#ModelMetricClaimedVerifiedStatus
1Local Prior Matching (Large Model)Word Error Rate (WER)20.84Unverified
2SnipsWord Error Rate (WER)16.5Unverified
3Local Prior Matching (Large Model, ConvLM LM)Word Error Rate (WER)15.28Unverified
4Deep Speech 2Word Error Rate (WER)13.25Unverified
5TDNN + pNorm + speed up/down speechWord Error Rate (WER)12.5Unverified
6CTC-CRF 4gram-LMWord Error Rate (WER)10.65Unverified
7Convolutional Speech RecognitionWord Error Rate (WER)10.47Unverified
8MT4SSLWord Error Rate (WER)9.6Unverified
9Jasper DR 10x5Word Error Rate (WER)8.79Unverified
10EspressoWord Error Rate (WER)8.7Unverified
#ModelMetricClaimedVerifiedStatus
1Deep SpeechPercentage error20Unverified
2DNN-HMMPercentage error18.5Unverified
3CD-DNNPercentage error16.1Unverified
4DNNPercentage error16Unverified
5DNN + DropoutPercentage error15Unverified
6DNN BMMIPercentage error12.9Unverified
7DNN MPEPercentage error12.9Unverified
8DNN MMIPercentage error12.9Unverified
9HMM-TDNN + pNorm + speed up/down speechPercentage error12.9Unverified
10HMM-DNN +sMBRPercentage error12.6Unverified
#ModelMetricClaimedVerifiedStatus
1LSNNPercentage error33.2Unverified
2LAS multitask with indicators samplingPercentage error20.4Unverified
3Soft Monotonic Attention (ours, offline)Percentage error20.1Unverified
4QCNN-10L-256FMPercentage error19.64Unverified
5Bi-LSTM + skip connections w/ CTCPercentage error17.7Unverified
6Bi-RNN + AttentionPercentage error17.6Unverified
7RNN-CRF on 24(x3) MFSCPercentage error17.3Unverified
8CNN in time and frequency + dropout, 17.6% w/o dropoutPercentage error16.7Unverified
9Light Gated Recurrent UnitsPercentage error16.7Unverified
10GRUPercentage error16.6Unverified
#ModelMetricClaimedVerifiedStatus
1AttWord Error Rate (WER)18.7Unverified
2CTC/AttWord Error Rate (WER)6.7Unverified
3BRA-EWord Error Rate (WER)6.63Unverified
4CTC-CRF 4gram-LMWord Error Rate (WER)6.34Unverified
5BATWord Error Rate (WER)4.97Unverified
6ParaformerWord Error Rate (WER)4.95Unverified
7U2Word Error Rate (WER)4.72Unverified
8UMAWord Error Rate (WER)4.7Unverified
9Lightweight TransducerWord Error Rate (WER)4.31Unverified
10CIF-HKD With LMWord Error Rate (WER)4.1Unverified
#ModelMetricClaimedVerifiedStatus
1Jasper 10x3Word Error Rate (WER)6.9Unverified
2CNN over RAW speech (wav)Word Error Rate (WER)5.6Unverified
3CTC-CRF 4gram-LMWord Error Rate (WER)3.79Unverified
4Deep Speech 2Word Error Rate (WER)3.6Unverified
5test-set on open vocabulary (i.e. harder), model = HMM-DNN + pNorm*Word Error Rate (WER)3.6Unverified
6Convolutional Speech RecognitionWord Error Rate (WER)3.5Unverified
7TC-DNN-BLSTM-DNNWord Error Rate (WER)3.5Unverified
8EspressoWord Error Rate (WER)3.4Unverified
9CTC-CRF VGG-BLSTMWord Error Rate (WER)3.2Unverified
10Transformer with Relaxed AttentionWord Error Rate (WER)3.19Unverified