Speech Synthesis

Speech synthesis is the task of generating speech from some other modality like text, lip movements etc.

Please note that the leaderboards here are not really comparable between studies - as they use mean opinion score as a metric and collect different samples from Amazon Mechnical Turk.

( Image credit: WaveNet: A generative model for raw audio )

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 476–500 of 1249 papers

Title	Date	Tasks	Status	Hype
Grad-StyleSpeech: Any-speaker Adaptive Text-to-Speech Synthesis with Diffusion Models	Nov 17, 2022	Speech Synthesistext-to-speech	—Unverified	0
The Potential of Neural Speech Synthesis-based Data Augmentation for Personalized Speech Enhancement	Nov 14, 2022	Data AugmentationSpeech Enhancement	—Unverified	0
OverFlow: Putting flows on top of neural transducers for better TTS	Nov 13, 2022	Normalising FlowsSpeech Synthesis	CodeCode Available	1
Semi-supervised learning for continuous emotional intensity controllable speech synthesis with disentangled representations	Nov 11, 2022	Emotional Speech SynthesisSpeech Synthesis	—Unverified	0
PhaseAug: A Differentiable Augmentation for Speech Synthesis to Simulate One-to-Many Mapping	Nov 8, 2022	Generative Adversarial NetworkSpeech Synthesis	CodeCode Available	1
ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech	Nov 7, 2022	Representation LearningSpeech Representation Learning	CodeCode Available	6
Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder	Nov 7, 2022	Speech Synthesistext-to-speech	CodeCode Available	1
Deliberation Networks and How to Train Them	Nov 6, 2022	Machine TranslationSpeech Synthesis	—Unverified	0
Self-Supervised Learning for Speech Enhancement through Synthesis	Nov 4, 2022	DenoisingSelf-Supervised Learning	CodeCode Available	0
SAMO: Speaker Attractor Multi-Center One-Class Learning for Voice Anti-Spoofing	Nov 4, 2022	DiversitySpeaker Verification	CodeCode Available	1
Predicting phoneme-level prosody latents using AR and flow-based Prior Networks for expressive speech synthesis	Nov 2, 2022	Expressive Speech SynthesisSpeech Synthesis	—Unverified	0
Taiwanese-Accented Mandarin and English Multi-Speaker Talking-Face Synthesis System	Nov 1, 2022	Face GenerationSpeech Synthesis	—Unverified	0
A Preliminary Study on Mandarin-Hakka neural machine translation using small-sized data	Nov 1, 2022	Machine TranslationSpeech Synthesis	—Unverified	0
Development of Mandarin-English code-switching speech synthesis system	Nov 1, 2022	SentenceSpeech Synthesis	—Unverified	0
Technology Pipeline for Large Scale Cross-Lingual Dubbing of Lecture Videos into Multiple Indian Languages	Nov 1, 2022	ChunkingRhythm	—Unverified	0
Learning utterance-level representations through token-level acoustic latents prediction for Expressive Speech Synthesis	Nov 1, 2022	DisentanglementDiversity	—Unverified	0
Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers	Nov 1, 2022	parameter-efficient fine-tuningSpeech Synthesis	—Unverified	0
Towards Developing State-of-the-Art TTS Synthesisers for 13 Indian Languages with Signal Processing aided Alignments	Oct 31, 2022	Speech Synthesis	—Unverified	0
Period VITS: Variational Inference with Explicit Pitch Modeling for End-to-end Emotional Speech Synthesis	Oct 28, 2022	DecoderDiversity	—Unverified	0
Evaluating context-invariance in unsupervised speech representations	Oct 27, 2022	Language Modellingspeech-recognition	CodeCode Available	0
FCTalker: Fine and Coarse Grained Context Modeling for Expressive Conversational Speech Synthesis	Oct 27, 2022	Speech Synthesistext-to-speech	CodeCode Available	1
Articulation GAN: Unsupervised modeling of articulatory learning	Oct 27, 2022	Generative Adversarial NetworkSpeech Synthesis	CodeCode Available	1
A Fast and Accurate Pitch Estimation Algorithm Based on the Pseudo Wigner-Ville Distribution	Oct 27, 2022	Speech Synthesis	CodeCode Available	0
Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech	Oct 27, 2022	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified	0
RedPen: Region- and Reason-Annotated Dataset of Unnatural Speech	Oct 26, 2022	Speech Synthesis	—Unverified	0

Show:10 25 50

← PrevPage 20 of 50Next →

All datasets LibriTTS North American English LJSpeech Mandarin Chinese Blizzard Challenge 2013

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	PeriodWave-Turbo-L	PESQ	4.45	—	Unverified
2	BigVGAN-v2	PESQ	4.36	—	Unverified
3	EVA-GAN-big	PESQ	4.35	—	Unverified
4	PeriodWave + FreeU	PESQ	4.25	—	Unverified
5	RFWave	PESQ	4.23	—	Unverified
6	BigVSAN (w/ snakebeta)	PESQ	4.12	—	Unverified
7	BigVSAN	PESQ	4.12	—	Unverified
8	EVA-GAN-base	PESQ	4.03	—	Unverified
9	BigVGAN	PESQ	4.03	—	Unverified
10	Vocos	PESQ	3.7	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	4.53	—	Unverified
2	WaveNet (Linguistic)	Mean Opinion Score	4.34	—	Unverified
3	WaveNet (L+F)	Mean Opinion Score	4.21	—	Unverified
4	Tacotron	Mean Opinion Score	4	—	Unverified
5	HMM-driven concatenative	Mean Opinion Score	3.86	—	Unverified
6	LSTM-RNN parametric	Mean Opinion Score	3.67	—	Unverified
7	means	Mean Opinion Score	0	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	BDDM vocoder	Mean Opinion Score	4.48	—	Unverified
2	DiffWave LARGE	Mean Opinion Score	4.44	—	Unverified
3	Neural HMM	Mean Opinion Score	3.24	—	Unverified
4	Neural HMM Ablation with 1 state per phone	Mean Opinion Score	2.68	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	WaveNet (L+F)	Mean Opinion Score	4.08	—	Unverified
2	LSTM-RNN parametric	Mean Opinion Score	3.79	—	Unverified
3	HMM-driven concatenative	Mean Opinion Score	3.47	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	SampleRNN (2-tier)	NLL	1.39	—	Unverified
2	SampleRNN (3-tier)	NLL	1.39	—	Unverified