Speech Synthesis

Speech synthesis is the task of generating speech from some other modality like text, lip movements etc.

Please note that the leaderboards here are not really comparable between studies - as they use mean opinion score as a metric and collect different samples from Amazon Mechnical Turk.

( Image credit: WaveNet: A generative model for raw audio )

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 851–900 of 1249 papers

Title	Date	Tasks	Status
Versatile Speech Databases for High Quality Synthesis for Basque	May 1, 2012	Emotional Speech SynthesisSpeech Synthesis	—Unverified
Vers une annotation automatique de corpus audio pour la synth\`ese de parole (Towards Fully Automatic Annotation of Audio Books for Text-To-Speech (TTS) Synthesis) [in French]	Jun 1, 2012	Speech Synthesistext-to-speech	—Unverified
VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing	Nov 30, 2022	Machine TranslationSentence	—Unverified
Video-to-Video Translation for Visual Speech Synthesis	May 28, 2019	Image-to-Image TranslationSpeech Synthesis	—Unverified
Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech	Oct 27, 2022	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via Speech-Visage Feature Selection	Jun 15, 2022	feature selectionSpeech Synthesis	—Unverified
Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis	Nov 26, 2024	Decodermultimodal generation	—Unverified
Visual-Aware Text-to-Speech	Jun 21, 2023	RhythmSpeech Synthesis	—Unverified
VisualTTS: TTS with Accurate Lip-Speech Synchronization for Automatic Voice Over	Oct 7, 2021	Speech Synthesistext-to-speech	—Unverified
VNet: A GAN-based Multi-Tier Discriminator Network for Speech Synthesis Vocoders	Aug 13, 2024	Speech Synthesis	—Unverified
Vocoder-Based Speech Synthesis from Silent Videos	Apr 6, 2020	Multi-Task LearningSpeech Synthesis	—Unverified
Voice Cloning: a Multi-Speaker Text-to-Speech Synthesis Approach based on Transfer Learning	Feb 10, 2021	Speech Synthesistext-to-speech	—Unverified
Voice Conversion by Cascading Automatic Speech Recognition and Text-to-Speech Synthesis with Prosody Transfer	Sep 3, 2020	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
Voice Conversion for Whispered Speech Synthesis	Dec 11, 2019	Speech SynthesisVoice Conversion	—Unverified
Voice Conversion Using Sequence-to-Sequence Learning of Context Posterior Probabilities	Apr 10, 2017	speech-recognitionSpeech Recognition	—Unverified
VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models	Apr 3, 2025	Speech Synthesis	—Unverified
VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis	Dec 26, 2024	Audio GenerationSpeech Synthesis	—Unverified
Voice Filter: Few-shot text-to-speech speaker adaptation using voice conversion as a post-processing module	Feb 16, 2022	Speech Synthesistext-to-speech	—Unverified
VoxGenesis: Unsupervised Discovery of Latent Speaker Manifold for Speech Synthesis	Mar 1, 2024	Speech Synthesis	—Unverified
VoxHakka: A Dialectally Diverse Multi-speaker Text-to-Speech System for Taiwanese Hakka	Sep 3, 2024	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks	Sep 14, 2023	DecoderLanguage Modeling	—Unverified
VQalAttent: a Transparent Speech Generation Pipeline based on Transformer-learned VQ-VAE Latent Space	Nov 22, 2024	Audio SynthesisDecoder	—Unverified
VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature	Apr 2, 2022	Speech Synthesistext-to-speech	—Unverified
Wasserstein GAN and Waveform Loss-based Acoustic Model Training for Multi-speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder	Jul 31, 2018	Generative Adversarial NetworkSpeech Synthesis	—Unverified
WaveCycleGAN2: Time-domain Neural Post-filter for Speech Waveform Generation	Apr 5, 2019	Speech Synthesis	—Unverified
WaveCycleGAN: Synthetic-to-natural speech waveform conversion using cycle-consistent adversarial networks	Sep 25, 2018	Speech SynthesisVoice Conversion	—Unverified
Waveform generation for text-to-speech synthesis using pitch-synchronous multi-scale generative adversarial networks	Oct 30, 2018	Image GenerationSpeech Synthesis	—Unverified
Wave-U-Net Discriminator: Fast and Lightweight Discriminator for Generative Adversarial Network-Based Speech Synthesis	Mar 24, 2023	Generative Adversarial NetworkSpeech Synthesis	—Unverified
WavThruVec: Latent speech representation as intermediate features for neural speech synthesis	Mar 31, 2022	Speech Synthesistext-to-speech	—Unverified
Weakly-supervised text-to-speech alignment confidence measure	Dec 1, 2016	speech-recognitionSpeech Recognition	—Unverified
WebWOZ: A Platform for Designing and Conducting Web-based Wizard of Oz Experiments	Aug 1, 2013	Machine TranslationSpeech Recognition	—Unverified
We Need Variations in Speech Generation: Sub-center Modelling for Speaker Embeddings	Jul 5, 2024	Speaker RecognitionSpeech Synthesis	—Unverified
What happens to diffusion model likelihood when your model is conditional?	Sep 10, 2024	domain classificationmodel	—Unverified
What the Future Brings: Investigating the Impact of Lookahead for Incremental Neural TTS	Sep 4, 2020	DecoderSentence	—Unverified
Which Prosodic Features Matter Most for Pragmatics?	Aug 23, 2024	Speech Synthesis	—Unverified
Which Synthetic Voice Should I Choose for an Evocative Task?	Sep 1, 2015	Speech SynthesisText-To-Speech Synthesis	—Unverified
Whispered and Lombard Neural Speech Synthesis	Jan 13, 2021	Speaker VerificationSpeech Synthesis	—Unverified
Whither the Priors for (Vocal) Interactivity?	Mar 16, 2022	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
WinkTalk: a demonstration of a multimodal speech synthesis platform linking facial expressions to expressive synthetic voices	Jun 1, 2012	Speech Synthesis	—Unverified
WOLONet: Wave Outlooker for Efficient and High Fidelity Speech Synthesis	Jun 20, 2022	CPUSpeech Synthesis	—Unverified
Word-Level Style Control for Expressive, Non-attentive Speech Synthesis	Nov 19, 2021	Expressive Speech SynthesisSpeech Synthesis	—Unverified
應用文脈分析於中英夾雜語音合成系統(Linguistic Analysis for English/Mandarin Speech Synthesis System)	Oct 1, 2019	Speech Synthesis	—Unverified
You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation	May 14, 2020	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
Zero-Shot Long-Form Voice Cloning with Dynamic Convolution Attention	Jan 25, 2022	FormSpeech Synthesis	—Unverified
Zero-Shot Mono-to-Binaural Speech Synthesis	Dec 11, 2024	Audio SynthesisDenoising	—Unverified
Zero-shot personalized lip-to-speech synthesis with face image based voice control	May 9, 2023	Lip to Speech SynthesisRepresentation Learning	—Unverified
Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling	May 26, 2025	SentenceSpeech Synthesis	—Unverified
Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model	Apr 24, 2023	RhythmSelf-Supervised Learning	—Unverified
ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models	May 23, 2023	Speech Synthesistext-to-speech	—Unverified
整合語者嵌入向量與後置濾波器於提升個人化合成語音之語者相似度 (Incorporating Speaker Embedding and Post-Filter Network for Improving Speaker Similarity of Personalized Speech Synthesis System)	Dec 1, 2021	Speech Synthesis	—Unverified

Show:10 25 50

← PrevPage 18 of 25Next →

All datasets LibriTTS North American English LJSpeech Mandarin Chinese Blizzard Challenge 2013

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	PeriodWave-Turbo-L	PESQ	4.45	—	Unverified
2	BigVGAN-v2	PESQ	4.36	—	Unverified
3	EVA-GAN-big	PESQ	4.35	—	Unverified
4	PeriodWave + FreeU	PESQ	4.25	—	Unverified
5	RFWave	PESQ	4.23	—	Unverified
6	BigVSAN (w/ snakebeta)	PESQ	4.12	—	Unverified
7	BigVSAN	PESQ	4.12	—	Unverified
8	EVA-GAN-base	PESQ	4.03	—	Unverified
9	BigVGAN	PESQ	4.03	—	Unverified
10	Vocos	PESQ	3.7	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	4.53	—	Unverified
2	WaveNet (Linguistic)	Mean Opinion Score	4.34	—	Unverified
3	WaveNet (L+F)	Mean Opinion Score	4.21	—	Unverified
4	Tacotron	Mean Opinion Score	4	—	Unverified
5	HMM-driven concatenative	Mean Opinion Score	3.86	—	Unverified
6	LSTM-RNN parametric	Mean Opinion Score	3.67	—	Unverified
7	means	Mean Opinion Score	0	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	BDDM vocoder	Mean Opinion Score	4.48	—	Unverified
2	DiffWave LARGE	Mean Opinion Score	4.44	—	Unverified
3	Neural HMM	Mean Opinion Score	3.24	—	Unverified
4	Neural HMM Ablation with 1 state per phone	Mean Opinion Score	2.68	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	WaveNet (L+F)	Mean Opinion Score	4.08	—	Unverified
2	LSTM-RNN parametric	Mean Opinion Score	3.79	—	Unverified
3	HMM-driven concatenative	Mean Opinion Score	3.47	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	SampleRNN (2-tier)	NLL	1.39	—	Unverified
2	SampleRNN (3-tier)	NLL	1.39	—	Unverified