Speech Synthesis

Speech synthesis is the task of generating speech from some other modality like text, lip movements etc.

Please note that the leaderboards here are not really comparable between studies - as they use mean opinion score as a metric and collect different samples from Amazon Mechnical Turk.

( Image credit: WaveNet: A generative model for raw audio )

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 601–650 of 1249 papers

Title	Date	Tasks	Status
POS-Tag Based Poetry Generation with WordNet	Aug 1, 2013	POSSpeech Synthesis	—Unverified
Practical Evaluation of Human and Synthesized Speech for Virtual Human Dialogue Systems	May 1, 2012	Speech Synthesis	—Unverified
Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis	Aug 4, 2018	Speech Synthesistext-to-speech	—Unverified
Predicting phoneme-level prosody latents using AR and flow-based Prior Networks for expressive speech synthesis	Nov 2, 2022	Expressive Speech SynthesisSpeech Synthesis	—Unverified
Predicting Phrase Breaks in Classical and Modern Standard Arabic Text	May 1, 2012	ChunkingHuman Parsing	—Unverified
Predicting Romanian Stress Assignment	Apr 1, 2014	Speech SynthesisText-To-Speech Synthesis	—Unverified
Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance	Jun 25, 2021	QuantizationSpeaker anonymization	—Unverified
Pretraining Strategies, Waveform Model Choice, and Acoustic Configurations for Multi-Speaker End-to-End Speech Synthesis	Nov 10, 2020	Speech Synthesis	—Unverified
PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive Prior	Jun 11, 2021	Audio GenerationDenoising	—Unverified
Privacy-oriented manipulation of speaker representations	Oct 10, 2023	Speaker RecognitionSpeech Synthesis	—Unverified
Probabilistic Dialogue Models with Prior Domain Knowledge	Jul 1, 2012	Dialogue ManagementSemantic Parsing	—Unverified
Probing Speaker-specific Features in Speaker Representations	Jan 9, 2025	Self-Supervised LearningSpeaker Verification	—Unverified
Probing the Feasibility of Multilingual Speaker Anonymization	Jul 3, 2024	Speaker anonymizationSpeech Synthesis	—Unverified
PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control	Jan 10, 2025	Speech Synthesistext-to-speech	—Unverified
PSCodec: A Series of High-Fidelity Low-bitrate Neural Speech Codecs Leveraging Prompt Encoders	Apr 3, 2024	Representation LearningSpeaker Verification	—Unverified
Prompt-Unseen-Emotion: Zero-shot Expressive Speech Synthesis with Prompt-LLM Contextual Knowledge for Mixed Emotions	Jun 3, 2025	Expressive Speech SynthesisPrompt Learning	—Unverified
Pronunciation Dictionary-Free Multilingual Speech Synthesis by Combining Unsupervised and Supervised Phonetic Representations	Jun 2, 2022	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis	Nov 19, 2021	ClusteringDecoder	—Unverified
Prosodic Prominence and Boundaries in Sequence-to-Sequence Speech Synthesis	Jun 29, 2020	SentenceSpeech Synthesis	—Unverified
Prosodic Representation Learning and Contextual Sampling for Neural Text-to-Speech	Nov 4, 2020	Graph AttentionRepresentation Learning	—Unverified
ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis	Dec 16, 2024	Speech Synthesistext-to-speech	—Unverified
Prosody Learning Mechanism for Speech Synthesis System Without Text Length Limit	Aug 13, 2020	Language ModelingLanguage Modelling	—Unverified
Prosody-TTS: An end-to-end speech synthesis system with prosody control	Oct 6, 2021	RhythmSpeech Synthesis	—Unverified
Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis	Apr 14, 2025	Language ModelingLanguage Modelling	—Unverified
Punjabi Text-To-Speech Synthesis System	Dec 1, 2012	Speech Synthesistext-to-speech	—Unverified
PyDial: A Multi-domain Statistical Dialogue System Toolkit	Jul 1, 2017	Dialogue ManagementSpeech Recognition	—Unverified
pyiwn: A Python based API to access Indian Language WordNets	Jan 1, 2018	Speech Synthesis	—Unverified
QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis	Mar 14, 2023	Emotional Speech SynthesisSentence	—Unverified
Quantitative Analysis of Audio-Visual Tasks: An Information-Theoretic Perspective	Sep 29, 2024	Audio-Visual Speech RecognitionLip Reading	—Unverified
RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis	Apr 4, 2024	Language ModelingLanguage Modelling	—Unverified
RASMALAI: Resources for Adaptive Speech Modeling in Indian Languages with Accents and Intonations	May 24, 2025	Expressive Speech SynthesisSpeech Synthesis	—Unverified
Real-time Incremental Speech-to-Speech Translation of Dialogs	Jun 1, 2012	Machine TranslationSpeech Recognition	—Unverified
Real-Time Single-Speaker Taiwanese-Accented Mandarin Speech Synthesis System	Sep 1, 2020	Speech Synthesis	—Unverified
ReCAB-VAE: Gumbel-Softmax Variational Inference Based on Analytic Divergence	May 9, 2022	Speech Synthesistext-to-speech	—Unverified
Recurrent Neural Network Postfilters for Statistical Parametric Speech Synthesis	Jan 26, 2016	General Classificationregression	—Unverified
RedPen: Region- and Reason-Annotated Dataset of Unnatural Speech	Oct 26, 2022	Speech Synthesis	—Unverified
Referee: Towards reference-free cross-speaker style transfer with low-quality data for expressive speech synthesis	Sep 8, 2021	Expressive Speech SynthesisSentence	—Unverified
Refer-iTTS: A System for Referring in Spoken Installments to Objects in Real-World Images	Sep 1, 2017	Referring ExpressionReferring expression generation	—Unverified
Regeneration Learning: A Learning Paradigm for Data Generation	Jan 21, 2023	Image GenerationRepresentation Learning	—Unverified
Reinforcement Learning for Emotional Text-to-Speech Synthesis with Improved Emotion Discriminability	Apr 3, 2021	Emotion Recognitionreinforcement-learning	—Unverified
DLPO: Diffusion Model Loss-Guided Reinforcement Learning for Fine-Tuning Text-to-Speech Diffusion Models	May 23, 2024	Image Generationreinforcement-learning	—Unverified
Removing Speaker Information from Speech Representation using Variable-Length Soft Pooling	Apr 1, 2024	Speaker IdentificationSpeech Synthesis	—Unverified
Replay Spoofing Countermeasure Using Autoencoder and Siamese Network on ASVspoof 2019 Challenge	Oct 29, 2019	Speaker VerificationSpeech Synthesis	—Unverified
Residual-guided Personalized Speech Synthesis based on Face Image	Apr 1, 2022	Speech Synthesis	—Unverified
Response Generation Based on Hierarchical Semantic Structure with POMDP Re-ranking for Conversational Dialogue Systems	Oct 1, 2013	Dialogue ManagementInformation Retrieval	—Unverified
Retrieval-Augmented Audio Deepfake Detection	Apr 22, 2024	Audio Deepfake DetectionDeepFake Detection	—Unverified
Review of end-to-end speech synthesis technology based on deep learning	Apr 20, 2021	Speech Synthesis	—Unverified
ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement	Dec 21, 2022	Audio-Visual Speech RecognitionResynthesis	—Unverified
ReVISE: Self-Supervised Speech Resynthesis With Visual Input for Universal and Generalized Speech Regeneration	Jan 1, 2023	Audio-Visual Speech RecognitionResynthesis	—Unverified
Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis	May 25, 2025	Speech Synthesistext-to-speech	—Unverified

Show:10 25 50

← PrevPage 13 of 25Next →

All datasets LibriTTS North American English LJSpeech Mandarin Chinese Blizzard Challenge 2013

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	PeriodWave-Turbo-L	PESQ	4.45	—	Unverified
2	BigVGAN-v2	PESQ	4.36	—	Unverified
3	EVA-GAN-big	PESQ	4.35	—	Unverified
4	PeriodWave + FreeU	PESQ	4.25	—	Unverified
5	RFWave	PESQ	4.23	—	Unverified
6	BigVSAN (w/ snakebeta)	PESQ	4.12	—	Unverified
7	BigVSAN	PESQ	4.12	—	Unverified
8	EVA-GAN-base	PESQ	4.03	—	Unverified
9	BigVGAN	PESQ	4.03	—	Unverified
10	Vocos	PESQ	3.7	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	4.53	—	Unverified
2	WaveNet (Linguistic)	Mean Opinion Score	4.34	—	Unverified
3	WaveNet (L+F)	Mean Opinion Score	4.21	—	Unverified
4	Tacotron	Mean Opinion Score	4	—	Unverified
5	HMM-driven concatenative	Mean Opinion Score	3.86	—	Unverified
6	LSTM-RNN parametric	Mean Opinion Score	3.67	—	Unverified
7	means	Mean Opinion Score	0	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	BDDM vocoder	Mean Opinion Score	4.48	—	Unverified
2	DiffWave LARGE	Mean Opinion Score	4.44	—	Unverified
3	Neural HMM	Mean Opinion Score	3.24	—	Unverified
4	Neural HMM Ablation with 1 state per phone	Mean Opinion Score	2.68	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	WaveNet (L+F)	Mean Opinion Score	4.08	—	Unverified
2	LSTM-RNN parametric	Mean Opinion Score	3.79	—	Unverified
3	HMM-driven concatenative	Mean Opinion Score	3.47	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	SampleRNN (2-tier)	NLL	1.39	—	Unverified
2	SampleRNN (3-tier)	NLL	1.39	—	Unverified