Speech Synthesis

Speech synthesis is the task of generating speech from some other modality like text, lip movements etc.

Please note that the leaderboards here are not really comparable between studies - as they use mean opinion score as a metric and collect different samples from Amazon Mechnical Turk.

( Image credit: WaveNet: A generative model for raw audio )

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 951–975 of 1249 papers

Title	Date	Tasks	Status
Whispered and Lombard Neural Speech Synthesis	Jan 13, 2021	Speaker VerificationSpeech Synthesis	—Unverified
Whither the Priors for (Vocal) Interactivity?	Mar 16, 2022	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
WinkTalk: a demonstration of a multimodal speech synthesis platform linking facial expressions to expressive synthetic voices	Jun 1, 2012	Speech Synthesis	—Unverified
WOLONet: Wave Outlooker for Efficient and High Fidelity Speech Synthesis	Jun 20, 2022	CPUSpeech Synthesis	—Unverified
Word-Level Style Control for Expressive, Non-attentive Speech Synthesis	Nov 19, 2021	Expressive Speech SynthesisSpeech Synthesis	—Unverified
應用文脈分析於中英夾雜語音合成系統(Linguistic Analysis for English/Mandarin Speech Synthesis System)	Oct 1, 2019	Speech Synthesis	—Unverified
You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation	May 14, 2020	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
Zero-Shot Long-Form Voice Cloning with Dynamic Convolution Attention	Jan 25, 2022	FormSpeech Synthesis	—Unverified
Zero-Shot Mono-to-Binaural Speech Synthesis	Dec 11, 2024	Audio SynthesisDenoising	—Unverified
Zero-shot personalized lip-to-speech synthesis with face image based voice control	May 9, 2023	Lip to Speech SynthesisRepresentation Learning	—Unverified
Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling	May 26, 2025	SentenceSpeech Synthesis	—Unverified
Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model	Apr 24, 2023	RhythmSelf-Supervised Learning	—Unverified
ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models	May 23, 2023	Speech Synthesistext-to-speech	—Unverified
整合語者嵌入向量與後置濾波器於提升個人化合成語音之語者相似度 (Incorporating Speaker Embedding and Post-Filter Network for Improving Speaker Similarity of Personalized Speech Synthesis System)	Dec 1, 2021	Speech Synthesis	—Unverified
Prompt-Unseen-Emotion: Zero-shot Expressive Speech Synthesis with Prompt-LLM Contextual Knowledge for Mixed Emotions	Jun 3, 2025	Expressive Speech SynthesisPrompt Learning	—Unverified
Pronunciation Dictionary-Free Multilingual Speech Synthesis by Combining Unsupervised and Supervised Phonetic Representations	Jun 2, 2022	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis	Nov 19, 2021	ClusteringDecoder	—Unverified
Prosodic Prominence and Boundaries in Sequence-to-Sequence Speech Synthesis	Jun 29, 2020	SentenceSpeech Synthesis	—Unverified
Prosodic Representation Learning and Contextual Sampling for Neural Text-to-Speech	Nov 4, 2020	Graph AttentionRepresentation Learning	—Unverified
ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis	Dec 16, 2024	Speech Synthesistext-to-speech	—Unverified
Prosody Learning Mechanism for Speech Synthesis System Without Text Length Limit	Aug 13, 2020	Language ModelingLanguage Modelling	—Unverified
Prosody-TTS: An end-to-end speech synthesis system with prosody control	Oct 6, 2021	RhythmSpeech Synthesis	—Unverified
Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis	Apr 14, 2025	Language ModelingLanguage Modelling	—Unverified
Punjabi Text-To-Speech Synthesis System	Dec 1, 2012	Speech Synthesistext-to-speech	—Unverified
PyDial: A Multi-domain Statistical Dialogue System Toolkit	Jul 1, 2017	Dialogue ManagementSpeech Recognition	—Unverified

Show:10 25 50

← PrevPage 39 of 50Next →

All datasets LibriTTS North American English LJSpeech Mandarin Chinese Blizzard Challenge 2013

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	PeriodWave-Turbo-L	PESQ	4.45	—	Unverified
2	BigVGAN-v2	PESQ	4.36	—	Unverified
3	EVA-GAN-big	PESQ	4.35	—	Unverified
4	PeriodWave + FreeU	PESQ	4.25	—	Unverified
5	RFWave	PESQ	4.23	—	Unverified
6	BigVSAN (w/ snakebeta)	PESQ	4.12	—	Unverified
7	BigVSAN	PESQ	4.12	—	Unverified
8	EVA-GAN-base	PESQ	4.03	—	Unverified
9	BigVGAN	PESQ	4.03	—	Unverified
10	Vocos	PESQ	3.7	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	4.53	—	Unverified
2	WaveNet (Linguistic)	Mean Opinion Score	4.34	—	Unverified
3	WaveNet (L+F)	Mean Opinion Score	4.21	—	Unverified
4	Tacotron	Mean Opinion Score	4	—	Unverified
5	HMM-driven concatenative	Mean Opinion Score	3.86	—	Unverified
6	LSTM-RNN parametric	Mean Opinion Score	3.67	—	Unverified
7	means	Mean Opinion Score	0	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	BDDM vocoder	Mean Opinion Score	4.48	—	Unverified
2	DiffWave LARGE	Mean Opinion Score	4.44	—	Unverified
3	Neural HMM	Mean Opinion Score	3.24	—	Unverified
4	Neural HMM Ablation with 1 state per phone	Mean Opinion Score	2.68	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	WaveNet (L+F)	Mean Opinion Score	4.08	—	Unverified
2	LSTM-RNN parametric	Mean Opinion Score	3.79	—	Unverified
3	HMM-driven concatenative	Mean Opinion Score	3.47	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	SampleRNN (2-tier)	NLL	1.39	—	Unverified
2	SampleRNN (3-tier)	NLL	1.39	—	Unverified