Speech Synthesis

Speech synthesis is the task of generating speech from some other modality like text, lip movements etc.

Please note that the leaderboards here are not really comparable between studies - as they use mean opinion score as a metric and collect different samples from Amazon Mechnical Turk.

( Image credit: WaveNet: A generative model for raw audio )

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 701–750 of 1249 papers

Title	Date	Tasks	Status
Speaker Anonymization with Phonetic Intermediate Representations	Jul 11, 2022	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
Speaker-independent neural formant synthesis	Jun 2, 2023	Speech Synthesis	—Unverified
Speaker-independent raw waveform model for glottal excitation	Apr 25, 2018	modelSpeech Synthesis	—Unverified
Speaker-Independent Speech-Driven Visual Speech Synthesis using Domain-Adapted Acoustic Models	May 15, 2019	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
Speaker verification-derived loss and data augmentation for DNN-based multispeaker speech synthesis	Jun 3, 2021	Data AugmentationSpeaker Verification	—Unverified
Speaking in Wavelet Domain: A Simple and Efficient Approach to Speed up Speech Diffusion Model	Feb 16, 2024	DenoisingSpeech Enhancement	—Unverified
Speaking rate attention-based duration prediction for speed control TTS	Oct 13, 2023	AttributeSpeech Synthesis	—Unverified
Speaking style adaptation in Text-To-Speech synthesis using Sequence-to-sequence models with attention	Oct 29, 2018	Speech Synthesistext-to-speech	—Unverified
Speak While You Think: Streaming Speech Synthesis During Text Generation	Sep 20, 2023	Speech SynthesisText Generation	—Unverified
SPEAK WITH YOUR HANDS Using Continuous Hand Gestures to control Articulatory Speech Synthesizer	Feb 2, 2021	Speech Synthesis	—Unverified
SPEAK YOUR MIND! Towards Imagined Speech Recognition With Hierarchical Deep Learning	Apr 8, 2019	Brain Computer InterfaceGeneral Classification	—Unverified
SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and Music Synthesis	Jan 30, 2024	Generative Adversarial NetworkSpeech Synthesis	—Unverified
Spectral Codecs: Improving Non-Autoregressive Speech Synthesis with Spectrogram-Based Audio Codecs	Jun 7, 2024	QuantizationSpeech Synthesis	—Unverified
Speech: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction	May 8, 2013	Speech SynthesisSpeech-to-Text	—Unverified
Speech Bandwidth Expansion Via High Fidelity Generative Adversarial Networks	Jul 26, 2024	Generative Adversarial NetworkSpeech Enhancement	—Unverified
SpeechComposer: Unifying Multiple Speech Tasks with Prompt Composition	Jan 31, 2024	DecoderLanguage Modeling	—Unverified
Speech denoising by parametric resynthesis	Apr 2, 2019	DenoisingResynthesis	—Unverified
Speech earthquakes: scaling and universality in human voice	Aug 5, 2014	Speech Synthesis	—Unverified
Speech inpainting: Context-based speech synthesis guided by video	Jun 1, 2023	speech-recognitionSpeech Recognition	—Unverified
Speech-MLP: a simple MLP architecture for speech processing	Sep 29, 2021	Keyword SpottingSpeech Enhancement	—Unverified
Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis	Jul 8, 2025	Data AugmentationMixture-of-Experts	—Unverified
Speech Recognition with Augmented Synthesized Speech	Sep 25, 2019	Data AugmentationDiversity	—Unverified
Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis	Feb 11, 2024	RhythmSpeaker Identification	—Unverified
Speech Synthesis along Perceptual Voice Quality Dimensions	Jan 15, 2025	Expressive Speech SynthesisSpeech Synthesis	—Unverified
Speech Synthesis as Augmentation for Low-Resource ASR	Dec 23, 2020	Data Augmentationspeech-recognition	—Unverified
Speech Synthesis for Low Resource Languages using Transliteration Enabled Transfer Learning	Nov 16, 2021	speech-recognitionSpeech Recognition	—Unverified
Speech Synthesis of Code-Mixed Text	May 1, 2016	Language IdentificationSpeech Synthesis	—Unverified
Speech Synthesis using EEG	Feb 22, 2020	EEGElectroencephalogram (EEG)	—Unverified
Speech Synthesis with Mixed Emotions	Aug 11, 2022	AttributeEmotional Speech Synthesis	—Unverified
Speech vocoding for laboratory phonology	Jan 22, 2016	Speech Synthesistext-to-speech	—Unverified
Speech Synthesis By Unrolling Diffusion Process using Neural Network Layers	Sep 18, 2023	DenoisingSpeech Synthesis	—Unverified
Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models	Jul 18, 2024	Language ModelingLanguage Modelling	—Unverified
Sprachsynthese -- State-of-the-Art in englischer und deutscher Sprache	Jun 11, 2021	Speech Synthesis	—Unverified
Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting	Dec 28, 2024	Speech Synthesistext-to-speech	—Unverified
Statistical Evaluation of Pronunciation Encoding	May 1, 2012	Speech RecognitionSpeech Synthesis	—Unverified
Statistical Parametric Speech Synthesis Using Bottleneck Representation From Sequence Auto-encoder	Jun 19, 2016	Speech Synthesis	—Unverified
Strategies in Transfer Learning for Low-Resource Speech Synthesis: Phone Mapping, Features Input, and Source Language Selection	Jun 21, 2023	Automatic Speech Recognitionspeech-recognition	—Unverified
StreamVC: Real-Time Low-Latency Voice Conversion	Jan 5, 2024	Speech SynthesisVoice Conversion	—Unverified
StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis	Sep 24, 2024	Speech Synthesistext-to-speech	—Unverified
Style-Label-Free: Cross-Speaker Style Transfer by Quantized VAE and Speaker-wise Normalization in Speech Synthesis	Dec 13, 2022	Data AugmentationSpeech Synthesis	—Unverified
Style Mixture of Experts for Expressive Text-To-Speech Synthesis	Jun 5, 2024	Mixture-of-ExpertsSpeech Synthesis	—Unverified
STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech	Mar 17, 2021	Speech SynthesisStyle Transfer	—Unverified
StyleSpeech: Self-supervised Style Enhancing with VQ-VAE-based Pre-training for Expressive Audiobook Speech Synthesis	Dec 19, 2023	DecoderSpeech Synthesis	—Unverified
StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion	Sep 16, 2024	Speech Synthesistext-to-speech	—Unverified
Style Variation as a Vantage Point for Code-Switching	May 1, 2020	Language ModelingLanguage Modelling	—Unverified
SupertonicTTS: Towards Highly Scalable and Efficient Text-to-Speech System	Mar 29, 2025	Speech Synthesistext-to-speech	—Unverified
SUT System Description for Anti-Spoofing 2017 Challenge	Nov 1, 2017	QuantizationSpeaker Verification	—Unverified
SwissDial: Parallel Multidialectal Corpus of Spoken Swiss German	Mar 21, 2021	Speech Synthesis	—Unverified
Syllabification by Phone Categorization	Jul 15, 2018	Retrievalspeech-recognition	—Unverified
SynCLR: A Synthesis Framework for Contrastive Learning of out-of-domain Speech Representations	Sep 29, 2021	Contrastive LearningData Augmentation	—Unverified

Show:10 25 50

← PrevPage 15 of 25Next →

All datasets LibriTTS North American English LJSpeech Mandarin Chinese Blizzard Challenge 2013

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	PeriodWave-Turbo-L	PESQ	4.45	—	Unverified
2	BigVGAN-v2	PESQ	4.36	—	Unverified
3	EVA-GAN-big	PESQ	4.35	—	Unverified
4	PeriodWave + FreeU	PESQ	4.25	—	Unverified
5	RFWave	PESQ	4.23	—	Unverified
6	BigVSAN (w/ snakebeta)	PESQ	4.12	—	Unverified
7	BigVSAN	PESQ	4.12	—	Unverified
8	EVA-GAN-base	PESQ	4.03	—	Unverified
9	BigVGAN	PESQ	4.03	—	Unverified
10	Vocos	PESQ	3.7	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	4.53	—	Unverified
2	WaveNet (Linguistic)	Mean Opinion Score	4.34	—	Unverified
3	WaveNet (L+F)	Mean Opinion Score	4.21	—	Unverified
4	Tacotron	Mean Opinion Score	4	—	Unverified
5	HMM-driven concatenative	Mean Opinion Score	3.86	—	Unverified
6	LSTM-RNN parametric	Mean Opinion Score	3.67	—	Unverified
7	means	Mean Opinion Score	0	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	BDDM vocoder	Mean Opinion Score	4.48	—	Unverified
2	DiffWave LARGE	Mean Opinion Score	4.44	—	Unverified
3	Neural HMM	Mean Opinion Score	3.24	—	Unverified
4	Neural HMM Ablation with 1 state per phone	Mean Opinion Score	2.68	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	WaveNet (L+F)	Mean Opinion Score	4.08	—	Unverified
2	LSTM-RNN parametric	Mean Opinion Score	3.79	—	Unverified
3	HMM-driven concatenative	Mean Opinion Score	3.47	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	SampleRNN (2-tier)	NLL	1.39	—	Unverified
2	SampleRNN (3-tier)	NLL	1.39	—	Unverified