SOTAVerified

Speech Recognition

Speech Recognition is the task of converting spoken language into text. It involves recognizing the words spoken in an audio recording and transcribing them into a written format. The goal is to accurately transcribe the speech in real-time or from recorded audio, taking into account factors such as accents, speaking speed, and background noise.

( Image credit: SpecAugment )

Papers

Showing 751800 of 6433 papers

TitleStatusHype
Dynamic Data Pruning for Automatic Speech Recognition0
ArzEn-LLM: Code-Switched Egyptian Arabic-English Translation and Speech Recognition Using LLMsCode1
FASA: a Flexible and Automatic Speech Aligner for Extracting High-quality Aligned Children Speech DataCode0
Automatic speech recognition for the Nepali language using CNN, bidirectional LSTM and ResNetCode1
Sequential Editing for Lifelong Training of Speech Recognition Models0
MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization0
Towards Building an End-to-End Multilingual Automatic Lyrics Transcription ModelCode1
A Comprehensive Solution to Connect Speech Encoder and Large Language Model for ASR0
Investigating Confidence Estimation Measures for Speaker Diarization0
Blending LLMs into Cascaded Speech Translation: KIT's Offline Speech Translation System for IWSLT 20240
Decoder-only Architecture for Streaming End-to-end Speech Recognition0
Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss0
Acoustic Feature Mixup for Balanced Multi-aspect Pronunciation Assessment0
PI-Whisper: Designing an Adaptive and Incremental Automatic Speech Recognition System for Edge Devices0
InterBiasing: Boost Unseen Word Recognition through Biasing Intermediate Predictions0
Perception of Phonological Assimilation by Neural Speech Recognition Models0
An Adapter-Based Unified Model for Multiple Spoken Language Processing Tasks0
Intelligent Interface: Enhancing Lecture Engagement with Didactic Activity Summaries0
DASB -- Discrete Audio and Speech Benchmark0
Joint vs Sequential Speaker-Role Detection and Automatic Speech Recognition for Air-traffic Control0
Children's Speech Recognition through Discrete Token Enhancement0
ManWav: The First Manchu ASR Model0
Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting0
Transcribe, Align and Segment: Creating speech datasets for low-resource languages0
Finding Task-specific Subnetworks in Multi-task Spoken Language Understanding Model0
Performant ASR Models for Medical Entities in Accented Speech0
Growing Trees on Sounds: Assessing Strategies for End-to-End Dependency Parsing of SpeechCode0
Unsupervised Online Continual Learning for Automatic Speech RecognitionCode0
SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token SynchronizationCode2
Self-Train Before You TranscribeCode0
GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and RefinementCode3
Automatic Speech Recognition for Biomedical Data in Bengali Language0
Continual Test-time Adaptation for End-to-end Speech Recognition on Noisy SpeechCode1
Large Language Models for Dysfluency Detection in Stuttered Speech0
Optimized Speculative Sampling for GPU Hardware AcceleratorsCode0
Imperceptible Rhythm Backdoor Attacks: Exploring Rhythm Transformation for Embedding Undetectable Vulnerabilities on Speech Recognition0
CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving0
Trading Devil: Robust backdoor attack via Stochastic investment models and Bayesian approach0
Speech Emotion Recognition Using CNN and Its Use Case in Digital Healthcare0
ROAR: Reinforcing Original to Augmented Data Ratio Dynamics for Wav2Vec2.0 Based ASR0
CNVSRC 2023: The First Chinese Continuous Visual Speech Recognition Challenge0
Optimizing Byte-level Representation for End-to-end ASR0
On the Evaluation of Speech Foundation Models for Spoken Language Understanding0
An efficient text augmentation approach for contextualized Mandarin speech recognition0
Simul-Whisper: Attention-Guided Streaming Whisper with Truncation DetectionCode2
Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and TranslationCode3
Perceiver-Prompt: Flexible Speaker Adaptation in Whisper for Chinese Disordered Speech Recognition0
Learning Language Structures through Grounding0
Inclusive ASR for Disfluent Speech: Cascaded Large-Scale Self-Supervised Learning with Targeted Fine-Tuning and Data Augmentation0
Multi-Channel Multi-Speaker ASR Using Target Speaker's Solo Segment0
Show:102550
← PrevPage 16 of 129Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1AmNetWord Error Rate (WER)8.6Unverified
2HMM-(SAT)GMMWord Error Rate (WER)8Unverified
3Local Prior Matching (Large Model)Word Error Rate (WER)7.19Unverified
4SnipsWord Error Rate (WER)6.4Unverified
5Li-GRUWord Error Rate (WER)6.2Unverified
6HMM-DNN + pNorm*Word Error Rate (WER)5.5Unverified
7CTC + policy learningWord Error Rate (WER)5.42Unverified
8Deep Speech 2Word Error Rate (WER)5.33Unverified
9Gated ConvNetsWord Error Rate (WER)4.8Unverified
10HMM-TDNN + iVectorsWord Error Rate (WER)4.8Unverified
#ModelMetricClaimedVerifiedStatus
1Local Prior Matching (Large Model)Word Error Rate (WER)20.84Unverified
2SnipsWord Error Rate (WER)16.5Unverified
3Local Prior Matching (Large Model, ConvLM LM)Word Error Rate (WER)15.28Unverified
4Deep Speech 2Word Error Rate (WER)13.25Unverified
5TDNN + pNorm + speed up/down speechWord Error Rate (WER)12.5Unverified
6CTC-CRF 4gram-LMWord Error Rate (WER)10.65Unverified
7Convolutional Speech RecognitionWord Error Rate (WER)10.47Unverified
8MT4SSLWord Error Rate (WER)9.6Unverified
9Jasper DR 10x5Word Error Rate (WER)8.79Unverified
10EspressoWord Error Rate (WER)8.7Unverified
#ModelMetricClaimedVerifiedStatus
1Deep SpeechPercentage error20Unverified
2DNN-HMMPercentage error18.5Unverified
3CD-DNNPercentage error16.1Unverified
4DNNPercentage error16Unverified
5DNN + DropoutPercentage error15Unverified
6DNN BMMIPercentage error12.9Unverified
7HMM-TDNN + pNorm + speed up/down speechPercentage error12.9Unverified
8DNN MPEPercentage error12.9Unverified
9DNN MMIPercentage error12.9Unverified
10CNN + Bi-RNN + CTC (speech to letters), 25.9% WER if trainedonlyon SWBPercentage error12.6Unverified
#ModelMetricClaimedVerifiedStatus
1LSNNPercentage error33.2Unverified
2LAS multitask with indicators samplingPercentage error20.4Unverified
3Soft Monotonic Attention (ours, offline)Percentage error20.1Unverified
4QCNN-10L-256FMPercentage error19.64Unverified
5Bi-LSTM + skip connections w/ CTCPercentage error17.7Unverified
6Bi-RNN + AttentionPercentage error17.6Unverified
7RNN-CRF on 24(x3) MFSCPercentage error17.3Unverified
8CNN in time and frequency + dropout, 17.6% w/o dropoutPercentage error16.7Unverified
9Light Gated Recurrent UnitsPercentage error16.7Unverified
10GRUPercentage error16.6Unverified
#ModelMetricClaimedVerifiedStatus
1AttWord Error Rate (WER)18.7Unverified
2CTC/AttWord Error Rate (WER)6.7Unverified
3BRA-EWord Error Rate (WER)6.63Unverified
4CTC-CRF 4gram-LMWord Error Rate (WER)6.34Unverified
5BATWord Error Rate (WER)4.97Unverified
6ParaformerWord Error Rate (WER)4.95Unverified
7U2Word Error Rate (WER)4.72Unverified
8UMAWord Error Rate (WER)4.7Unverified
9Lightweight TransducerWord Error Rate (WER)4.31Unverified
10CIF-HKD With LMWord Error Rate (WER)4.1Unverified
#ModelMetricClaimedVerifiedStatus
1Jasper 10x3Word Error Rate (WER)6.9Unverified
2CNN over RAW speech (wav)Word Error Rate (WER)5.6Unverified
3CTC-CRF 4gram-LMWord Error Rate (WER)3.79Unverified
4Deep Speech 2Word Error Rate (WER)3.6Unverified
5test-set on open vocabulary (i.e. harder), model = HMM-DNN + pNorm*Word Error Rate (WER)3.6Unverified
6TC-DNN-BLSTM-DNNWord Error Rate (WER)3.5Unverified
7Convolutional Speech RecognitionWord Error Rate (WER)3.5Unverified
8EspressoWord Error Rate (WER)3.4Unverified
9CTC-CRF VGG-BLSTMWord Error Rate (WER)3.2Unverified
10Transformer with Relaxed AttentionWord Error Rate (WER)3.19Unverified