Speech Synthesis

Speech synthesis is the task of generating speech from some other modality like text, lip movements etc.

Please note that the leaderboards here are not really comparable between studies - as they use mean opinion score as a metric and collect different samples from Amazon Mechnical Turk.

( Image credit: WaveNet: A generative model for raw audio )

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 801–850 of 1249 papers

Title	Date	Tasks	Status
Combining Manual and Automatic Prosodic Annotation for Expressive Speech Synthesis	May 1, 2016	Expressive Speech SynthesisSpeech Synthesis	—Unverified
Compact Neural TTS Voices for Accessibility	Jan 28, 2025	Speech Synthesistext-to-speech	—Unverified
Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech	Jul 31, 2023	Acoustic ModellingSpeech Synthesis	—Unverified
Comparing performance of different set-covering strategies for linguistic content optimization in speech corpora	May 1, 2012	DescriptiveSpeech Recognition	—Unverified
Complete reconstruction of the tongue contour through acoustic to articulatory inversion using real-time MRI data	Nov 4, 2024	Speech Synthesis	—Unverified
Computer-Aided Quality Assurance of an Icelandic Pronunciation Dictionary	May 1, 2014	speech-recognitionSpeech Recognition	—Unverified
Computer-assisted Pronunciation Training -- Speech synthesis is almost all you need	Jul 2, 2022	AllSpeech Synthesis	—Unverified
CONCSS: Contrastive-based Context Comprehension for Dialogue-appropriate Prosody in Conversational Speech Synthesis	Dec 16, 2023	Contrastive LearningSelf-Supervised Learning	—Unverified
Conditional Spoken Digit Generation with StyleGAN	Sep 15, 2020	Image GenerationSpeech Synthesis	—Unverified
Conditioning Sequence-to-sequence Networks with Learned Activations	Sep 29, 2021	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
Construction of English-French Multimodal Affective Conversational Corpus from TV Dramas	May 1, 2018	Emotion RecognitionSpeech Recognition	—Unverified
Constructive Interaction for Talking about Interesting Topics	May 1, 2012	ManagementSpeech Recognition	—Unverified
Contextual Expressive Text-to-Speech	Nov 26, 2022	Speech Synthesistext-to-speech	—Unverified
Continual Speaker Adaptation for Text-to-Speech Synthesis	Mar 26, 2021	Continual LearningDiversity	—Unverified
Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech Synthesis	Feb 3, 2025	QuantizationSpeech Synthesis	—Unverified
Semi-supervised learning for continuous emotional intensity controllable speech synthesis with disentangled representations	Nov 11, 2022	Emotional Speech SynthesisSpeech Synthesis	—Unverified
Continuous Expressive Speaking Styles Synthesis based on CVSM and MR-HMM	Dec 1, 2016	Expressive Speech SynthesisSpeech Recognition	—Unverified
Continuous Speech Synthesis using per-token Latent Diffusion	Oct 21, 2024	Image GenerationQuantization	—Unverified
Continuous Wavelet Vocoder-based Decomposition of Parametric Speech Waveform Synthesis	Jun 12, 2021	Speech Synthesis	—Unverified
Controllable Accented Text-to-Speech Synthesis	Sep 22, 2022	Speech Synthesistext-to-speech	—Unverified
Controllable Context-aware Conversational Speech Synthesis	Jun 21, 2021	Speech Synthesis	—Unverified
Controllable Data Generation by Deep Learning: A Review	Jul 19, 2022	Deep LearningSpeech Synthesis	—Unverified
Controllable Generation of Artificial Speaker Embeddings through Discovery of Principal Directions	Oct 26, 2023	Speech Synthesis	—Unverified
Controllable Neural Prosody Synthesis	Aug 7, 2020	Speech Synthesis	—Unverified
Controllable neural text-to-speech synthesis using intuitive prosodic features	Sep 14, 2020	SentenceSpeech Synthesis	—Unverified
Controllable Sequence-To-Sequence Neural TTS with LPCNET Backend for Real-time Speech Synthesis on CPU	Feb 25, 2020	CPUProsody Prediction	—Unverified
Controllable speech synthesis by learning discrete phoneme-level prosodic representations	Nov 29, 2022	ClusteringSpeech Synthesis	—Unverified
Controllable Prosody Generation With Partial Inputs	Mar 14, 2023	Speech Synthesistext-to-speech	—Unverified
Corpus Generation for Voice Command in Smart Home and the Effect of Speech Synthesis on End-to-End SLU	May 1, 2020	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
Corpus Synthesis for Zero-shot ASR domain Adaptation using Large Language Models	Sep 18, 2023	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
Visual-speech Synthesis of Exaggerated Corrective Feedback	Sep 12, 2020	Speech Synthesis	—Unverified
Counterfactual Activation Editing for Post-hoc Prosody and Mispronunciation Correction in TTS Models	Jun 1, 2025	counterfactualSpeech Synthesis	—Unverified
CPJD Corpus: Crowdsourced Parallel Speech Corpus of Japanese Dialects	May 1, 2018	Machine TranslationSpeech Recognition	—Unverified
Creating New Language and Voice Components for the Updated MaryTTS Text-to-Speech Synthesis Platform	Dec 13, 2017	Speech Synthesistext-to-speech	—Unverified
Creating New Voices using Normalizing Flows	Dec 22, 2023	Speech Synthesistext-to-speech	—Unverified
Creating Personalized Synthetic Voices from Post-Glossectomy Speech with Guided Diffusion Models	May 27, 2023	Speech SynthesisVoice Conversion	—Unverified
Cross-lingual Knowledge Distillation via Flow-based Voice Conversion for Robust Polyglot Text-To-Speech	Sep 15, 2023	Knowledge DistillationSpeech Synthesis	—Unverified
Cross-lingual Low Resource Speaker Adaptation Using Phonological Features	Nov 17, 2021	Speech Synthesis	—Unverified
Cross-lingual Multi-speaker Text-to-speech Synthesis for Voice Cloning without Using Parallel Corpus for Unseen Speakers	Nov 26, 2019	Speech Synthesistext-to-speech	—Unverified
Cross-lingual Multispeaker Text-to-Speech under Limited-Data Scenario	May 21, 2020	AttributeSpeech Synthesis	—Unverified
Cross-lingual Prosody Transfer for Expressive Machine Dubbing	Jun 20, 2023	Expressive Speech SynthesisSpeech Synthesis	—Unverified
Cross-Lingual Text-to-Speech Using Multi-Task Learning and Speaker Classifier Joint Training	Jan 20, 2022	Multi-Task LearningSpeech Synthesis	—Unverified
Cross-speaker Style Transfer with Prosody Bottleneck in Neural Speech Synthesis	Jul 27, 2021	Expressive Speech SynthesisSpeech Synthesis	—Unverified
CrossSpeech++: Cross-lingual Speech Synthesis with Decoupled Language and Speaker Generation	Dec 28, 2024	Speech Synthesis	—Unverified
CrossSpeech: Speaker-independent Acoustic Representation for Cross-lingual Speech Synthesis	Feb 28, 2023	Speech Synthesistext-to-speech	—Unverified
Cross-Utterance Conditioned VAE for Speech Generation	Sep 8, 2023	Speech Synthesistext-to-speech	—Unverified
Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis	Jun 15, 2021	Speech Synthesistext-to-speech	—Unverified
DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech	Oct 17, 2024	DisentanglementQuantization	—Unverified
DDOS: A MOS Prediction Framework utilizing Domain Adaptive Pre-training and Distribution of Opinion Scores	Apr 7, 2022	Self-Supervised LearningSpeech Synthesis	—Unverified
Debatts: Zero-Shot Debating Text-to-Speech Synthesis	Nov 10, 2024	Speech Synthesistext-to-speech	—Unverified

Show:10 25 50

← PrevPage 17 of 25Next →

All datasets LibriTTS North American English LJSpeech Mandarin Chinese Blizzard Challenge 2013

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	PeriodWave-Turbo-L	PESQ	4.45	—	Unverified
2	BigVGAN-v2	PESQ	4.36	—	Unverified
3	EVA-GAN-big	PESQ	4.35	—	Unverified
4	PeriodWave + FreeU	PESQ	4.25	—	Unverified
5	RFWave	PESQ	4.23	—	Unverified
6	BigVSAN (w/ snakebeta)	PESQ	4.12	—	Unverified
7	BigVSAN	PESQ	4.12	—	Unverified
8	EVA-GAN-base	PESQ	4.03	—	Unverified
9	BigVGAN	PESQ	4.03	—	Unverified
10	Vocos	PESQ	3.7	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	4.53	—	Unverified
2	WaveNet (Linguistic)	Mean Opinion Score	4.34	—	Unverified
3	WaveNet (L+F)	Mean Opinion Score	4.21	—	Unverified
4	Tacotron	Mean Opinion Score	4	—	Unverified
5	HMM-driven concatenative	Mean Opinion Score	3.86	—	Unverified
6	LSTM-RNN parametric	Mean Opinion Score	3.67	—	Unverified
7	means	Mean Opinion Score	0	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	BDDM vocoder	Mean Opinion Score	4.48	—	Unverified
2	DiffWave LARGE	Mean Opinion Score	4.44	—	Unverified
3	Neural HMM	Mean Opinion Score	3.24	—	Unverified
4	Neural HMM Ablation with 1 state per phone	Mean Opinion Score	2.68	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	WaveNet (L+F)	Mean Opinion Score	4.08	—	Unverified
2	LSTM-RNN parametric	Mean Opinion Score	3.79	—	Unverified
3	HMM-driven concatenative	Mean Opinion Score	3.47	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	SampleRNN (2-tier)	NLL	1.39	—	Unverified
2	SampleRNN (3-tier)	NLL	1.39	—	Unverified