Speech Synthesis

Speech synthesis is the task of generating speech from some other modality like text, lip movements etc.

Please note that the leaderboards here are not really comparable between studies - as they use mean opinion score as a metric and collect different samples from Amazon Mechnical Turk.

( Image credit: WaveNet: A generative model for raw audio )

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 901–950 of 1249 papers

Title	Date	Tasks	Status
Using NLG for speech synthesis of mathematical sentences	Oct 1, 2019	SentenceSpeech Synthesis	—Unverified
Using previous acoustic context to improve Text-to-Speech synthesis	Dec 7, 2020	DecoderSpeech Synthesis	—Unverified
Using VAEs and Normalizing Flows for One-shot Text-To-Speech Synthesis of Expressive Speech	Nov 28, 2019	DisentanglementExpressive Speech Synthesis	—Unverified
Utterance-level Sequential Modeling For Deep Gaussian Process Based Speech Synthesis Using Simple Recurrent Unit	Apr 22, 2020	Speech Synthesis	—Unverified
Unsupervised TTS Acoustic Modeling for TTS with Conditional Disentangled Sequential VAE	Jun 6, 2022	Representation LearningSpeech Representation Learning	—Unverified
UzbekTagger: The rule-based POS tagger for Uzbek language	Jan 30, 2023	Language ModelingLanguage Modelling	—Unverified
VAKTA-SETU: A Speech-to-Speech Machine Translation Service in Select Indic Languages	May 21, 2023	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers	Jun 8, 2024	Speech Synthesistext-to-speech	—Unverified
VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment	Jun 12, 2024	QuantizationSpeech Synthesis	—Unverified
VANI: Very-lightweight Accent-controllable TTS for Native and Non-native speakers with Identity Preservation	Mar 14, 2023	DisentanglementSpeech Synthesis	—Unverified
VARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based on Very Deep VAE with Residual Attention	Feb 12, 2021	Speech Synthesistext-to-speech	—Unverified
Variations prosodiques en synth\`ese par s\'election d'unit\'es: l'exemple des phrases interrogatives (Prosodic variations in unit-based speech synthesis: the example of interrogative sentences) [in French]	Jun 1, 2012	Speech SynthesisText-To-Speech Synthesis	—Unverified
VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion	Feb 18, 2022	QuantizationSpeech Synthesis	—Unverified
vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders	Sep 3, 2024	Speech SynthesisVoice Conversion	—Unverified
Versatile Speech Databases for High Quality Synthesis for Basque	May 1, 2012	Emotional Speech SynthesisSpeech Synthesis	—Unverified
Vers une annotation automatique de corpus audio pour la synth\`ese de parole (Towards Fully Automatic Annotation of Audio Books for Text-To-Speech (TTS) Synthesis) [in French]	Jun 1, 2012	Speech Synthesistext-to-speech	—Unverified
VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing	Nov 30, 2022	Machine TranslationSentence	—Unverified
Video-to-Video Translation for Visual Speech Synthesis	May 28, 2019	Image-to-Image TranslationSpeech Synthesis	—Unverified
Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech	Oct 27, 2022	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via Speech-Visage Feature Selection	Jun 15, 2022	feature selectionSpeech Synthesis	—Unverified
Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis	Nov 26, 2024	Decodermultimodal generation	—Unverified
Visual-Aware Text-to-Speech	Jun 21, 2023	RhythmSpeech Synthesis	—Unverified
VisualTTS: TTS with Accurate Lip-Speech Synchronization for Automatic Voice Over	Oct 7, 2021	Speech Synthesistext-to-speech	—Unverified
VNet: A GAN-based Multi-Tier Discriminator Network for Speech Synthesis Vocoders	Aug 13, 2024	Speech Synthesis	—Unverified
Vocoder-Based Speech Synthesis from Silent Videos	Apr 6, 2020	Multi-Task LearningSpeech Synthesis	—Unverified
Voice Cloning: a Multi-Speaker Text-to-Speech Synthesis Approach based on Transfer Learning	Feb 10, 2021	Speech Synthesistext-to-speech	—Unverified
Voice Conversion by Cascading Automatic Speech Recognition and Text-to-Speech Synthesis with Prosody Transfer	Sep 3, 2020	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
Voice Conversion for Whispered Speech Synthesis	Dec 11, 2019	Speech SynthesisVoice Conversion	—Unverified
Voice Conversion Using Sequence-to-Sequence Learning of Context Posterior Probabilities	Apr 10, 2017	speech-recognitionSpeech Recognition	—Unverified
VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models	Apr 3, 2025	Speech Synthesis	—Unverified
VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis	Dec 26, 2024	Audio GenerationSpeech Synthesis	—Unverified
Voice Filter: Few-shot text-to-speech speaker adaptation using voice conversion as a post-processing module	Feb 16, 2022	Speech Synthesistext-to-speech	—Unverified
VoxGenesis: Unsupervised Discovery of Latent Speaker Manifold for Speech Synthesis	Mar 1, 2024	Speech Synthesis	—Unverified
VoxHakka: A Dialectally Diverse Multi-speaker Text-to-Speech System for Taiwanese Hakka	Sep 3, 2024	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks	Sep 14, 2023	DecoderLanguage Modeling	—Unverified
VQalAttent: a Transparent Speech Generation Pipeline based on Transformer-learned VQ-VAE Latent Space	Nov 22, 2024	Audio SynthesisDecoder	—Unverified
VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature	Apr 2, 2022	Speech Synthesistext-to-speech	—Unverified
Wasserstein GAN and Waveform Loss-based Acoustic Model Training for Multi-speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder	Jul 31, 2018	Generative Adversarial NetworkSpeech Synthesis	—Unverified
WaveCycleGAN2: Time-domain Neural Post-filter for Speech Waveform Generation	Apr 5, 2019	Speech Synthesis	—Unverified
WaveCycleGAN: Synthetic-to-natural speech waveform conversion using cycle-consistent adversarial networks	Sep 25, 2018	Speech SynthesisVoice Conversion	—Unverified
Waveform generation for text-to-speech synthesis using pitch-synchronous multi-scale generative adversarial networks	Oct 30, 2018	Image GenerationSpeech Synthesis	—Unverified
Wave-U-Net Discriminator: Fast and Lightweight Discriminator for Generative Adversarial Network-Based Speech Synthesis	Mar 24, 2023	Generative Adversarial NetworkSpeech Synthesis	—Unverified
WavThruVec: Latent speech representation as intermediate features for neural speech synthesis	Mar 31, 2022	Speech Synthesistext-to-speech	—Unverified
Weakly-supervised text-to-speech alignment confidence measure	Dec 1, 2016	speech-recognitionSpeech Recognition	—Unverified
WebWOZ: A Platform for Designing and Conducting Web-based Wizard of Oz Experiments	Aug 1, 2013	Machine TranslationSpeech Recognition	—Unverified
We Need Variations in Speech Generation: Sub-center Modelling for Speaker Embeddings	Jul 5, 2024	Speaker RecognitionSpeech Synthesis	—Unverified
What happens to diffusion model likelihood when your model is conditional?	Sep 10, 2024	domain classificationmodel	—Unverified
What the Future Brings: Investigating the Impact of Lookahead for Incremental Neural TTS	Sep 4, 2020	DecoderSentence	—Unverified
Which Prosodic Features Matter Most for Pragmatics?	Aug 23, 2024	Speech Synthesis	—Unverified
Which Synthetic Voice Should I Choose for an Evocative Task?	Sep 1, 2015	Speech SynthesisText-To-Speech Synthesis	—Unverified

Show:10 25 50

← PrevPage 19 of 25Next →

All datasets LibriTTS North American English LJSpeech Mandarin Chinese Blizzard Challenge 2013

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	PeriodWave-Turbo-L	PESQ	4.45	—	Unverified
2	BigVGAN-v2	PESQ	4.36	—	Unverified
3	EVA-GAN-big	PESQ	4.35	—	Unverified
4	PeriodWave + FreeU	PESQ	4.25	—	Unverified
5	RFWave	PESQ	4.23	—	Unverified
6	BigVSAN (w/ snakebeta)	PESQ	4.12	—	Unverified
7	BigVSAN	PESQ	4.12	—	Unverified
8	EVA-GAN-base	PESQ	4.03	—	Unverified
9	BigVGAN	PESQ	4.03	—	Unverified
10	Vocos	PESQ	3.7	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	4.53	—	Unverified
2	WaveNet (Linguistic)	Mean Opinion Score	4.34	—	Unverified
3	WaveNet (L+F)	Mean Opinion Score	4.21	—	Unverified
4	Tacotron	Mean Opinion Score	4	—	Unverified
5	HMM-driven concatenative	Mean Opinion Score	3.86	—	Unverified
6	LSTM-RNN parametric	Mean Opinion Score	3.67	—	Unverified
7	means	Mean Opinion Score	0	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	BDDM vocoder	Mean Opinion Score	4.48	—	Unverified
2	DiffWave LARGE	Mean Opinion Score	4.44	—	Unverified
3	Neural HMM	Mean Opinion Score	3.24	—	Unverified
4	Neural HMM Ablation with 1 state per phone	Mean Opinion Score	2.68	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	WaveNet (L+F)	Mean Opinion Score	4.08	—	Unverified
2	LSTM-RNN parametric	Mean Opinion Score	3.79	—	Unverified
3	HMM-driven concatenative	Mean Opinion Score	3.47	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	SampleRNN (2-tier)	NLL	1.39	—	Unverified
2	SampleRNN (3-tier)	NLL	1.39	—	Unverified