Speech Synthesis

Speech synthesis is the task of generating speech from some other modality like text, lip movements etc.

Please note that the leaderboards here are not really comparable between studies - as they use mean opinion score as a metric and collect different samples from Amazon Mechnical Turk.

( Image credit: WaveNet: A generative model for raw audio )

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–25 of 1249 papers

Title	Date	Tasks	Status	Hype
CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens	Jul 7, 2024	Language ModellingLarge Language Model	CodeCode Available	11
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training	May 23, 2025	Automatic Speech RecognitionEmotion Recognition	CodeCode Available	11
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models	Dec 13, 2024	In-Context LearningQuantization	CodeCode Available	11
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming	Aug 29, 2024	Speech Synthesis	CodeCode Available	7
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers	Jan 5, 2023	In-Context LearningLanguage Modeling	CodeCode Available	7
Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model	Jun 10, 2025	Language ModelingLanguage Modelling	CodeCode Available	7
Better speech synthesis through scaling	May 12, 2023	Image GenerationSpeech Synthesis	CodeCode Available	6
PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit	May 20, 2022	AllAutomatic Speech Recognition (ASR)	CodeCode Available	6
ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech	Nov 7, 2022	Representation LearningSpeech Representation Learning	CodeCode Available	6
Very Low Complexity Speech Synthesis Using Framewise Autoregressive GAN (FARGAN) with Pitch Prediction	May 31, 2024	Speech Synthesis	CodeCode Available	5
StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning	Jun 5, 2024	Automatic Speech Recognition (ASR)de-en	CodeCode Available	5
Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling	Mar 7, 2023	In-Context LearningLanguage Modeling	CodeCode Available	5
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models	Jun 13, 2023	Speech Synthesistext-to-speech	CodeCode Available	5
Enhancing Suno's Bark Text-to-Speech Model: Addressing Limitations Through Meta's Encodec and Pre-Trained Hubert	Apr 18, 2023	Audio GenerationExpressive Speech Synthesis	CodeCode Available	4
Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis	Feb 6, 2025	Speech Synthesis	CodeCode Available	4
Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis	Jun 1, 2023	Audio SynthesisComputational Efficiency	CodeCode Available	4
ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching	Jun 16, 2025	DecoderSpeech Synthesis	CodeCode Available	4
End-to-end LPCNet: A Neural Vocoder With Fully-Differentiable LPC Estimation	Feb 23, 2022	Speech Synthesis	CodeCode Available	3
Neural Speech Synthesis on a Shoestring: Improving the Efficiency of LPCNet	Feb 22, 2022	Speech Synthesis	CodeCode Available	3
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models	Mar 5, 2024	QuantizationSpeech Synthesis	CodeCode Available	3
MoonCast: High-Quality Zero-Shot Podcast Generation	Mar 18, 2025	Speech Synthesistext-to-speech	CodeCode Available	3
LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis	May 5, 2025	ChatbotDecoder	CodeCode Available	3
ControlSpeech: Towards Simultaneous and Independent Zero-shot Speaker Cloning and Zero-shot Language Style Control	Jun 3, 2024	Speech Synthesistext-to-speech	CodeCode Available	3
Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model	Aug 30, 2024	Audio CompressionAudio Generation	CodeCode Available	3
Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning	Jul 9, 2019	Speech Synthesistext-to-speech	CodeCode Available	3

Show:10 25 50

← PrevPage 1 of 50Next →

All datasets LibriTTS North American English LJSpeech Mandarin Chinese Blizzard Challenge 2013

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	PeriodWave-Turbo-L	PESQ	4.45	—	Unverified
2	BigVGAN-v2	PESQ	4.36	—	Unverified
3	EVA-GAN-big	PESQ	4.35	—	Unverified
4	PeriodWave + FreeU	PESQ	4.25	—	Unverified
5	RFWave	PESQ	4.23	—	Unverified
6	BigVSAN (w/ snakebeta)	PESQ	4.12	—	Unverified
7	BigVSAN	PESQ	4.12	—	Unverified
8	EVA-GAN-base	PESQ	4.03	—	Unverified
9	BigVGAN	PESQ	4.03	—	Unverified
10	Vocos	PESQ	3.7	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	4.53	—	Unverified
2	WaveNet (Linguistic)	Mean Opinion Score	4.34	—	Unverified
3	WaveNet (L+F)	Mean Opinion Score	4.21	—	Unverified
4	Tacotron	Mean Opinion Score	4	—	Unverified
5	HMM-driven concatenative	Mean Opinion Score	3.86	—	Unverified
6	LSTM-RNN parametric	Mean Opinion Score	3.67	—	Unverified
7	means	Mean Opinion Score	0	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	BDDM vocoder	Mean Opinion Score	4.48	—	Unverified
2	DiffWave LARGE	Mean Opinion Score	4.44	—	Unverified
3	Neural HMM	Mean Opinion Score	3.24	—	Unverified
4	Neural HMM Ablation with 1 state per phone	Mean Opinion Score	2.68	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	WaveNet (L+F)	Mean Opinion Score	4.08	—	Unverified
2	LSTM-RNN parametric	Mean Opinion Score	3.79	—	Unverified
3	HMM-driven concatenative	Mean Opinion Score	3.47	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	SampleRNN (2-tier)	NLL	1.39	—	Unverified
2	SampleRNN (3-tier)	NLL	1.39	—	Unverified