SOTAVerified

Speech Recognition

Speech Recognition is the task of converting spoken language into text. It involves recognizing the words spoken in an audio recording and transcribing them into a written format. The goal is to accurately transcribe the speech in real-time or from recorded audio, taking into account factors such as accents, speaking speed, and background noise.

( Image credit: SpecAugment )

Papers

Showing 201250 of 6433 papers

TitleStatusHype
D4AM: A General Denoising Framework for Downstream Acoustic ModelsCode1
Do VSR Models Generalize Beyond LRS3?Code1
Zero-shot audio captioning with audio-language model guidance and audio context keywordsCode1
Improving Whispered Speech Recognition Performance using Pseudo-whispered based Data AugmentationCode1
GPU-Accelerated WFST Beam Search Decoder for CTC-based Speech RecognitionCode1
Improved Child Text-to-Speech Synthesis through Fastpitch-based Transfer LearningCode1
Multilingual DistilWhisper: Efficient Distillation of Multi-task Speech Models via Language-Specific ExpertsCode1
Automatic Disfluency Detection from Untranscribed SpeechCode1
End-to-End Single-Channel Speaker-Turn Aware Conversational Speech TranslationCode1
Developing a Multilingual Dataset and Evaluation Metrics for Code-Switching: A Focus on Hong Kong's Polylingual DynamicsCode1
CL-MASR: A Continual Learning Benchmark for Multilingual ASRCode1
ArTST: Arabic Text and Speech TransformerCode1
Accented Speech Recognition With Accent-specific CodebooksCode1
How Much Context Does My Attention-Based ASR System Need?Code1
Advancing Test-Time Adaptation in Wild Acoustic Test SettingsCode1
Unsupervised Speech Recognition with N-Skipgram and Positional Unigram MatchingCode1
Evaluating Speech Synthesis by Training Recognizers on Synthetic SpeechCode1
RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech SeparationCode1
HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language ModelsCode1
Speech collage: code-switched audio generation by collaging monolingual corporaCode1
Updated Corpora and Benchmarks for Long-Form Speech RecognitionCode1
Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available DataCode1
Fast-HuBERT: An Efficient Training Framework for Self-Supervised Speech Representation LearningCode1
Memory-augmented conformer for improved end-to-end long-form ASRCode1
Bridging the Gaps of Both Modality and Language: Synchronous Bilingual CTC for Speech Translation and Speech RecognitionCode1
Fine-Tuning Self-Supervised Learning Models for End-to-End Pronunciation ScoringCode1
HypR: A comprehensive study for ASR hypothesis revising with a reference corpusCode1
DiaCorrect: Error Correction Back-end For Speaker DiarizationCode1
Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from WhisperCode1
Unimodal Aggregation for CTC-based Speech RecognitionCode1
EnCodecMAE: Leveraging neural codecs for universal audio representation learningCode1
DiariST: Streaming Speech Translation with Speaker DiarizationCode1
BLSP: Bootstrapping Language-Speech Pre-training via Behavior Alignment of Continuation WritingCode1
Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion EncoderCode1
OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data GenerationCode1
ÌròyìnSpeech: A multi-purpose Yorùbá Speech CorpusCode1
Learning Multi-modal Representations by Watching Hundreds of Surgical Video LecturesCode1
Adaptation of Whisper models to child speech recognitionCode1
Zero-shot Domain-sensitive Speech Recognition with Prompt-conditioning Fine-tuningCode1
ivrit.ai: A Comprehensive Dataset of Hebrew Speech for AI Research and DevelopmentCode1
Towards Stealthy Backdoor Attacks against Speech Recognition via Elements of SoundCode1
Using joint training speaker encoder with consistency loss to achieve cross-lingual voice conversion and expressive voice conversionCode1
Learning Delays in Spiking Neural Networks using Dilated Convolutions with Learnable SpacingsCode1
LyricWhiz: Robust Multilingual Zero-shot Lyrics Transcription by Whispering to ChatGPTCode1
NoRefER: a Referenceless Quality Metric for Automatic Speech Recognition via Semi-Supervised Language Model Fine-Tuning with Contrastive LearningCode1
A Reference-less Quality Metric for Automatic Speech Recognition via Contrastive-Learning of a Multi-Language Model with Self-SupervisionCode1
Quilt-1M: One Million Image-Text Pairs for HistopathologyCode1
Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and Transfer for Robust Audio-Visual Speech RecognitionCode1
DuTa-VC: A Duration-aware Typical-to-atypical Voice Conversion Approach with Diffusion Probabilistic ModelCode1
STHG: Spatial-Temporal Heterogeneous Graph Learning for Advanced Audio-Visual DiarizationCode1
Show:102550
← PrevPage 5 of 129Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1AmNetWord Error Rate (WER)8.6Unverified
2HMM-(SAT)GMMWord Error Rate (WER)8Unverified
3Local Prior Matching (Large Model)Word Error Rate (WER)7.19Unverified
4SnipsWord Error Rate (WER)6.4Unverified
5Li-GRUWord Error Rate (WER)6.2Unverified
6HMM-DNN + pNorm*Word Error Rate (WER)5.5Unverified
7CTC + policy learningWord Error Rate (WER)5.42Unverified
8Deep Speech 2Word Error Rate (WER)5.33Unverified
9HMM-TDNN + iVectorsWord Error Rate (WER)4.8Unverified
10Gated ConvNetsWord Error Rate (WER)4.8Unverified
#ModelMetricClaimedVerifiedStatus
1Local Prior Matching (Large Model)Word Error Rate (WER)20.84Unverified
2SnipsWord Error Rate (WER)16.5Unverified
3Local Prior Matching (Large Model, ConvLM LM)Word Error Rate (WER)15.28Unverified
4Deep Speech 2Word Error Rate (WER)13.25Unverified
5TDNN + pNorm + speed up/down speechWord Error Rate (WER)12.5Unverified
6CTC-CRF 4gram-LMWord Error Rate (WER)10.65Unverified
7Convolutional Speech RecognitionWord Error Rate (WER)10.47Unverified
8MT4SSLWord Error Rate (WER)9.6Unverified
9Jasper DR 10x5Word Error Rate (WER)8.79Unverified
10EspressoWord Error Rate (WER)8.7Unverified
#ModelMetricClaimedVerifiedStatus
1Deep SpeechPercentage error20Unverified
2DNN-HMMPercentage error18.5Unverified
3CD-DNNPercentage error16.1Unverified
4DNNPercentage error16Unverified
5DNN + DropoutPercentage error15Unverified
6DNN BMMIPercentage error12.9Unverified
7DNN MPEPercentage error12.9Unverified
8DNN MMIPercentage error12.9Unverified
9HMM-TDNN + pNorm + speed up/down speechPercentage error12.9Unverified
10HMM-DNN +sMBRPercentage error12.6Unverified
#ModelMetricClaimedVerifiedStatus
1LSNNPercentage error33.2Unverified
2LAS multitask with indicators samplingPercentage error20.4Unverified
3Soft Monotonic Attention (ours, offline)Percentage error20.1Unverified
4QCNN-10L-256FMPercentage error19.64Unverified
5Bi-LSTM + skip connections w/ CTCPercentage error17.7Unverified
6Bi-RNN + AttentionPercentage error17.6Unverified
7RNN-CRF on 24(x3) MFSCPercentage error17.3Unverified
8CNN in time and frequency + dropout, 17.6% w/o dropoutPercentage error16.7Unverified
9Light Gated Recurrent UnitsPercentage error16.7Unverified
10GRUPercentage error16.6Unverified
#ModelMetricClaimedVerifiedStatus
1AttWord Error Rate (WER)18.7Unverified
2CTC/AttWord Error Rate (WER)6.7Unverified
3BRA-EWord Error Rate (WER)6.63Unverified
4CTC-CRF 4gram-LMWord Error Rate (WER)6.34Unverified
5BATWord Error Rate (WER)4.97Unverified
6ParaformerWord Error Rate (WER)4.95Unverified
7U2Word Error Rate (WER)4.72Unverified
8UMAWord Error Rate (WER)4.7Unverified
9Lightweight TransducerWord Error Rate (WER)4.31Unverified
10CIF-HKD With LMWord Error Rate (WER)4.1Unverified
#ModelMetricClaimedVerifiedStatus
1Jasper 10x3Word Error Rate (WER)6.9Unverified
2CNN over RAW speech (wav)Word Error Rate (WER)5.6Unverified
3CTC-CRF 4gram-LMWord Error Rate (WER)3.79Unverified
4Deep Speech 2Word Error Rate (WER)3.6Unverified
5test-set on open vocabulary (i.e. harder), model = HMM-DNN + pNorm*Word Error Rate (WER)3.6Unverified
6Convolutional Speech RecognitionWord Error Rate (WER)3.5Unverified
7TC-DNN-BLSTM-DNNWord Error Rate (WER)3.5Unverified
8EspressoWord Error Rate (WER)3.4Unverified
9CTC-CRF VGG-BLSTMWord Error Rate (WER)3.2Unverified
10Transformer with Relaxed AttentionWord Error Rate (WER)3.19Unverified