Speech Synthesis

Speech synthesis is the task of generating speech from some other modality like text, lip movements etc.

Please note that the leaderboards here are not really comparable between studies - as they use mean opinion score as a metric and collect different samples from Amazon Mechnical Turk.

( Image credit: WaveNet: A generative model for raw audio )

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 451–475 of 1249 papers

Title	Date	Tasks	Status	Hype
On granularity of prosodic representations in expressive text-to-speech	Jan 26, 2023	Expressive Speech SynthesisSpeech Synthesis	—Unverified	0
Multilingual Multiaccented Multispeaker TTS with RADTTS	Jan 24, 2023	Speech Synthesis	—Unverified	0
Regeneration Learning: A Learning Paradigm for Data Generation	Jan 21, 2023	Image GenerationRepresentation Learning	—Unverified	0
Applying Automated Machine Translation to Educational Video Courses	Jan 9, 2023	Machine TranslationSpeech Synthesis	—Unverified	0
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers	Jan 5, 2023	In-Context LearningLanguage Modeling	CodeCode Available	7
Towards Voice Reconstruction from EEG during Imagined Speech	Jan 2, 2023	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	CodeCode Available	1
ReVISE: Self-Supervised Speech Resynthesis With Visual Input for Universal and Generalized Speech Regeneration	Jan 1, 2023	Audio-Visual Speech RecognitionResynthesis	—Unverified	0
HMM-based data augmentation for E2E systems for building conversational speech synthesis systems	Dec 22, 2022	Data AugmentationLanguage Modeling	—Unverified	0
ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement	Dec 21, 2022	Audio-Visual Speech RecognitionResynthesis	—Unverified	0
Text-to-speech synthesis based on latent variable conversion using diffusion probabilistic model and variational autoencoder	Dec 16, 2022	Representation LearningSpeech Synthesis	—Unverified	0
Investigation of Japanese PnG BERT language model in text-to-speech synthesis for pitch accent language	Dec 16, 2022	Language ModelingLanguage Modelling	—Unverified	0
RWEN-TTS: Relation-aware Word Encoding Network for Natural Text-to-Speech Synthesis	Dec 15, 2022	RelationSpeech Synthesis	CodeCode Available	1
Style-Label-Free: Cross-Speaker Style Transfer by Quantized VAE and Speaker-wise Normalization in Speech Synthesis	Dec 13, 2022	Data AugmentationSpeech Synthesis	—Unverified	0
MnTTS2: An Open-Source Multi-Speaker Mongolian Text-to-Speech Synthesis Dataset	Dec 11, 2022	Speech Synthesistext-to-speech	CodeCode Available	1
VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing	Nov 30, 2022	Machine TranslationSentence	—Unverified	0
SNAC: Speaker-normalized affine coupling layer in flow-based architecture for zero-shot multi-speaker text-to-speech	Nov 30, 2022	Speech Synthesistext-to-speech	—Unverified	0
Controllable speech synthesis by learning discrete phoneme-level prosodic representations	Nov 29, 2022	ClusteringSpeech Synthesis	—Unverified	0
Contextual Expressive Text-to-Speech	Nov 26, 2022	Speech Synthesistext-to-speech	—Unverified	0
Efficient Incremental Text-to-Speech on GPUs	Nov 25, 2022	GPUSpeech Synthesis	—Unverified	0
PromptTTS: Controllable Text-to-Speech with Text Descriptions	Nov 22, 2022	DecoderSpeech Synthesis	CodeCode Available	0
Embedding a Differentiable Mel-cepstral Synthesis Filter to a Neural Speech Synthesis System	Nov 21, 2022	GPUSpeech Synthesis	CodeCode Available	1
LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders	Nov 20, 2022	Speech EnhancementSpeech Synthesis	—Unverified	0
Multi-Speaker Expressive Speech Synthesis via Multiple Factors Decoupling	Nov 19, 2022	Expressive Speech SynthesisSpeech Synthesis	—Unverified	0
Audio Anti-spoofing Using a Simple Attention Module and Joint Optimization Based on Additive Angular Margin Loss and Meta-learning	Nov 17, 2022	Binary ClassificationMeta-Learning	—Unverified	0
Grad-StyleSpeech: Any-speaker Adaptive Text-to-Speech Synthesis with Diffusion Models	Nov 17, 2022	Speech Synthesistext-to-speech	—Unverified	0

Show:10 25 50

← PrevPage 19 of 50Next →

All datasets LibriTTS North American English LJSpeech Mandarin Chinese Blizzard Challenge 2013

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	PeriodWave-Turbo-L	PESQ	4.45	—	Unverified
2	BigVGAN-v2	PESQ	4.36	—	Unverified
3	EVA-GAN-big	PESQ	4.35	—	Unverified
4	PeriodWave + FreeU	PESQ	4.25	—	Unverified
5	RFWave	PESQ	4.23	—	Unverified
6	BigVSAN (w/ snakebeta)	PESQ	4.12	—	Unverified
7	BigVSAN	PESQ	4.12	—	Unverified
8	EVA-GAN-base	PESQ	4.03	—	Unverified
9	BigVGAN	PESQ	4.03	—	Unverified
10	Vocos	PESQ	3.7	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	4.53	—	Unverified
2	WaveNet (Linguistic)	Mean Opinion Score	4.34	—	Unverified
3	WaveNet (L+F)	Mean Opinion Score	4.21	—	Unverified
4	Tacotron	Mean Opinion Score	4	—	Unverified
5	HMM-driven concatenative	Mean Opinion Score	3.86	—	Unverified
6	LSTM-RNN parametric	Mean Opinion Score	3.67	—	Unverified
7	means	Mean Opinion Score	0	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	BDDM vocoder	Mean Opinion Score	4.48	—	Unverified
2	DiffWave LARGE	Mean Opinion Score	4.44	—	Unverified
3	Neural HMM	Mean Opinion Score	3.24	—	Unverified
4	Neural HMM Ablation with 1 state per phone	Mean Opinion Score	2.68	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	WaveNet (L+F)	Mean Opinion Score	4.08	—	Unverified
2	LSTM-RNN parametric	Mean Opinion Score	3.79	—	Unverified
3	HMM-driven concatenative	Mean Opinion Score	3.47	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	SampleRNN (2-tier)	NLL	1.39	—	Unverified
2	SampleRNN (3-tier)	NLL	1.39	—	Unverified