Speech Synthesis

Speech synthesis is the task of generating speech from some other modality like text, lip movements etc.

Please note that the leaderboards here are not really comparable between studies - as they use mean opinion score as a metric and collect different samples from Amazon Mechnical Turk.

( Image credit: WaveNet: A generative model for raw audio )

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 26–50 of 1249 papers

Title	Date	Tasks	Status	Hype
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models	Mar 5, 2024	QuantizationSpeech Synthesis	CodeCode Available	3
HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis	Nov 21, 2023	Speech SynthesisSuper-Resolution	CodeCode Available	3
Matcha-TTS: A fast TTS architecture with conditional flow matching	Sep 6, 2023	Acoustic ModellingDecoder	CodeCode Available	3
ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech	Jul 13, 2022	DenoisingGPU	CodeCode Available	3
BigVGAN: A Universal Neural Vocoder with Large-Scale Training	Jun 9, 2022	Audio GenerationAudio Synthesis	CodeCode Available	3
Real-Time Packet Loss Concealment With Mixed Generative and Predictive Model	May 11, 2022	Packet Loss ConcealmentSpeech Enhancement	CodeCode Available	3
End-to-end LPCNet: A Neural Vocoder With Fully-Differentiable LPC Estimation	Feb 23, 2022	Speech Synthesis	CodeCode Available	3
Neural Speech Synthesis on a Shoestring: Improving the Efficiency of LPCNet	Feb 22, 2022	Speech Synthesis	CodeCode Available	3
UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation	Jun 15, 2021	Speech Synthesistext-to-speech	CodeCode Available	3
Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning	Jul 9, 2019	Speech Synthesistext-to-speech	CodeCode Available	3
RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching	Jun 20, 2025	SchedulingSpeech Synthesis	CodeCode Available	2
Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space	May 19, 2025	Language ModelingLanguage Modelling	CodeCode Available	2
WaveFM: A High-Fidelity and Efficient Vocoder Based on Flow Matching	Mar 20, 2025	Speech Synthesis	CodeCode Available	2
PodAgent: A Comprehensive Framework for Podcast Generation	Mar 1, 2025	Audio GenerationSpeech Synthesis	CodeCode Available	2
OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis	Jan 8, 2025	DecoderEmotional Speech Synthesis	CodeCode Available	2
Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis	Oct 30, 2024	Speech Synthesistext-to-speech	CodeCode Available	2
EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control	Oct 1, 2024	Emotional Speech SynthesisSpeech Synthesis	CodeCode Available	2
SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis	Sep 11, 2024	DecoderSpeech Synthesis	CodeCode Available	2
Sample-Efficient Diffusion for Text-To-Speech Synthesis	Sep 1, 2024	Language ModelingLanguage Modelling	CodeCode Available	2
SpeechCraft: A Fine-grained Expressive Speech Dataset with Natural Language Description	Aug 24, 2024	DescriptiveSpeech Synthesis	CodeCode Available	2
Speech Slytherin: Examining the Performance and Efficiency of Mamba for Speech Separation, Recognition, and Synthesis	Jul 13, 2024	Mambaspeech-recognition	CodeCode Available	2
DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability	Jun 27, 2024	Speech Synthesistext-to-speech	CodeCode Available	2
Small-E: Small Language Model with Linear Attention for Efficient Speech Synthesis	Jun 6, 2024	DecoderInductive Bias	CodeCode Available	2
Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness	Apr 10, 2024	Speech Synthesistext-to-speech	CodeCode Available	2
CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models	Mar 31, 2024	DenoisingSpeech Synthesis	CodeCode Available	2

Show:10 25 50

← PrevPage 2 of 50Next →

All datasets LibriTTS North American English LJSpeech Mandarin Chinese Blizzard Challenge 2013

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	PeriodWave-Turbo-L	PESQ	4.45	—	Unverified
2	BigVGAN-v2	PESQ	4.36	—	Unverified
3	EVA-GAN-big	PESQ	4.35	—	Unverified
4	PeriodWave + FreeU	PESQ	4.25	—	Unverified
5	RFWave	PESQ	4.23	—	Unverified
6	BigVSAN (w/ snakebeta)	PESQ	4.12	—	Unverified
7	BigVSAN	PESQ	4.12	—	Unverified
8	EVA-GAN-base	PESQ	4.03	—	Unverified
9	BigVGAN	PESQ	4.03	—	Unverified
10	Vocos	PESQ	3.7	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	4.53	—	Unverified
2	WaveNet (Linguistic)	Mean Opinion Score	4.34	—	Unverified
3	WaveNet (L+F)	Mean Opinion Score	4.21	—	Unverified
4	Tacotron	Mean Opinion Score	4	—	Unverified
5	HMM-driven concatenative	Mean Opinion Score	3.86	—	Unverified
6	LSTM-RNN parametric	Mean Opinion Score	3.67	—	Unverified
7	means	Mean Opinion Score	0	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	BDDM vocoder	Mean Opinion Score	4.48	—	Unverified
2	DiffWave LARGE	Mean Opinion Score	4.44	—	Unverified
3	Neural HMM	Mean Opinion Score	3.24	—	Unverified
4	Neural HMM Ablation with 1 state per phone	Mean Opinion Score	2.68	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	WaveNet (L+F)	Mean Opinion Score	4.08	—	Unverified
2	LSTM-RNN parametric	Mean Opinion Score	3.79	—	Unverified
3	HMM-driven concatenative	Mean Opinion Score	3.47	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	SampleRNN (2-tier)	NLL	1.39	—	Unverified
2	SampleRNN (3-tier)	NLL	1.39	—	Unverified