Speech Synthesis

Speech synthesis is the task of generating speech from some other modality like text, lip movements etc.

Please note that the leaderboards here are not really comparable between studies - as they use mean opinion score as a metric and collect different samples from Amazon Mechnical Turk.

( Image credit: WaveNet: A generative model for raw audio )

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 351–400 of 1249 papers

Title	Date	Tasks	Status	Hype
iSTFTNet2: Faster and More Lightweight iSTFT-Based Neural Vocoder Using 1D-2D CNN	Aug 14, 2023	Speech Synthesis	—Unverified	0
EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis	Aug 10, 2023	ResynthesisSpeech Synthesis	—Unverified	0
On Error Propagation of Diffusion Models	Aug 9, 2023	DenoisingImage Generation	—Unverified	0
Adversarial Training of Denoising Diffusion Model Using Dual Discriminators for High-Fidelity Multi-Speaker TTS	Aug 3, 2023	DenoisingSpeech Synthesis	—Unverified	0
Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation	Aug 3, 2023	DecoderQuantization	CodeCode Available	1
SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis	Aug 2, 2023	DecoderSelf-Supervised Learning	—Unverified	0
Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech	Jul 31, 2023	Acoustic ModellingSpeech Synthesis	—Unverified	0
DiffProsody: Diffusion-based Latent Prosody Generation for Expressive Speech Synthesis with Prosody Conditional Adversarial Training	Jul 31, 2023	DenoisingExpressive Speech Synthesis	CodeCode Available	1
Audio-visual video-to-speech synthesis with synthesized input audio	Jul 31, 2023	Speech Synthesis	—Unverified	0
METTS: Multilingual Emotional Text-to-Speech by Cross-speaker and Cross-lingual Emotion Transfer	Jul 29, 2023	DisentanglementDiversity	—Unverified	0
Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding	Jul 28, 2023	Language ModelingLanguage Modelling	—Unverified	0
SC VALL-E: Style-Controllable Zero-Shot Text to Speech Synthesizer	Jul 20, 2023	Expressive Speech SynthesisLanguage Modelling	CodeCode Available	1
An analysis on the effects of speaker embedding choice in non auto-regressive TTS	Jul 19, 2023	Representation LearningSpeech Synthesis	—Unverified	0
SLMGAN: Exploiting Speech Language Model Representations for Unsupervised Zero-Shot Voice Conversion in GANs	Jul 18, 2023	Generative Adversarial NetworkLanguage Modeling	—Unverified	0
Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis	Jul 14, 2023	In-Context LearningLanguage Modelling	—Unverified	0
On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis	Jul 11, 2023	PredictionSelf-Supervised Learning	—Unverified	0
Deep Speech Synthesis from MRI-Based Articulatory Representations	Jul 5, 2023	Computational EfficiencyDenoising	CodeCode Available	1
Disentanglement in a GAN for Unconditional Speech Synthesis	Jul 4, 2023	DisentanglementGenerative Adversarial Network	CodeCode Available	1
RobustL2S: Speaker-Specific Lip-to-Speech Synthesis exploiting Self-Supervised Representations	Jul 3, 2023	Lip to Speech SynthesisSpeaker-Specific Lip to Speech Synthesis	—Unverified	0
High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units	Jun 29, 2023	Speech Synthesistext-to-speech	—Unverified	0
EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech	Jun 28, 2023	Emotion RecognitionSpeech Synthesis	CodeCode Available	1
Large-scale unsupervised audio pre-training for video-to-speech synthesis	Jun 27, 2023	speech-recognitionSpeech Recognition	—Unverified	0
DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech	Jun 25, 2023	Speech Synthesistext-to-speech	—Unverified	0
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale	Jun 23, 2023	In-Context LearningSpeech Synthesis	CodeCode Available	0
Strategies in Transfer Learning for Low-Resource Speech Synthesis: Phone Mapping, Features Input, and Source Language Selection	Jun 21, 2023	Automatic Speech Recognitionspeech-recognition	—Unverified	0
Visual-Aware Text-to-Speech	Jun 21, 2023	RhythmSpeech Synthesis	—Unverified	0
Cross-lingual Prosody Transfer for Expressive Machine Dubbing	Jun 20, 2023	Expressive Speech SynthesisSpeech Synthesis	—Unverified	0
CML-TTS A Multilingual Dataset for Speech Synthesis in Low-Resource Languages	Jun 16, 2023	Speech Synthesistext-to-speech	—Unverified	0
Investigating the Utility of Surprisal from Large Language Models for Speech Synthesis Prosody	Jun 16, 2023	Speech Synthesis	—Unverified	0
Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis	Jun 15, 2023	DenoisingSpeech Synthesis	—Unverified	0
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models	Jun 13, 2023	Speech Synthesistext-to-speech	CodeCode Available	5
PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and Pause-based Prosody Modeling	Jun 13, 2023	Language ModelingLanguage Modelling	—Unverified	0
HiddenSinger: High-Quality Singing Voice Synthesis via Neural Audio Codec and Latent Diffusion Models	Jun 12, 2023	DenoisingSinging Voice Synthesis	—Unverified	0
Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion	Jun 9, 2023	DenoisingSpeech Synthesis	—Unverified	0
VIFS: An End-to-End Variational Inference for Foley Sound Synthesis	Jun 8, 2023	Speech Synthesistext-to-speech	CodeCode Available	0
Take the Hint: Improving Arabic Diacritization with Partially-Diacritized Text	Jun 6, 2023	Speech Synthesis	CodeCode Available	0
PolyVoice: Language Models for Speech to Speech Translation	Jun 5, 2023	Language ModelingLanguage Modelling	—Unverified	0
Rhythm-controllable Attention with High Robustness for Long Sentence Speech Synthesis	Jun 5, 2023	RhythmSentence	—Unverified	0
Why We Should Report the Details in Subjective Evaluation of TTS More Rigorously	Jun 3, 2023	Speech Synthesis	CodeCode Available	0
Speaker-independent neural formant synthesis	Jun 2, 2023	Speech Synthesis	—Unverified	0
Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis	Jun 1, 2023	Audio SynthesisComputational Efficiency	CodeCode Available	4
Speech inpainting: Context-based speech synthesis guided by video	Jun 1, 2023	speech-recognitionSpeech Recognition	—Unverified	0
Text-to-Speech Pipeline for Swiss German -- A comparison	May 31, 2023	Speech Synthesistext-to-speech	—Unverified	0
Intelligible Lip-to-Speech Synthesis with Speech Units	May 31, 2023	Lip to Speech SynthesisSpeech Synthesis	CodeCode Available	1
Automatic Evaluation of Turn-taking Cues in Conversational Speech Synthesis	May 29, 2023	Speech Synthesistext-to-speech	—Unverified	0
ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for Low-Resource TTS Adaptation	May 29, 2023	Speech Synthesistext-to-speech	CodeCode Available	1
Creating Personalized Synthetic Voices from Post-Glossectomy Speech with Guided Diffusion Models	May 27, 2023	Speech SynthesisVoice Conversion	—Unverified	0
Automatic Tuning of Loss Trade-offs without Hyper-parameter Search in End-to-End Zero-Shot Speech Synthesis	May 26, 2023	DecoderSpeech Synthesis	CodeCode Available	1
Multilingual Text-to-Speech Synthesis for Turkic Languages Using Transliteration	May 25, 2023	Speech Synthesistext-to-speech	CodeCode Available	1
Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM	May 24, 2023	Language ModellingQuestion Answering	CodeCode Available	0

Show:10 25 50

← PrevPage 8 of 25Next →

All datasets LibriTTS North American English LJSpeech Mandarin Chinese Blizzard Challenge 2013

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	PeriodWave-Turbo-L	PESQ	4.45	—	Unverified
2	BigVGAN-v2	PESQ	4.36	—	Unverified
3	EVA-GAN-big	PESQ	4.35	—	Unverified
4	PeriodWave + FreeU	PESQ	4.25	—	Unverified
5	RFWave	PESQ	4.23	—	Unverified
6	BigVSAN (w/ snakebeta)	PESQ	4.12	—	Unverified
7	BigVSAN	PESQ	4.12	—	Unverified
8	EVA-GAN-base	PESQ	4.03	—	Unverified
9	BigVGAN	PESQ	4.03	—	Unverified
10	Vocos	PESQ	3.7	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	4.53	—	Unverified
2	WaveNet (Linguistic)	Mean Opinion Score	4.34	—	Unverified
3	WaveNet (L+F)	Mean Opinion Score	4.21	—	Unverified
4	Tacotron	Mean Opinion Score	4	—	Unverified
5	HMM-driven concatenative	Mean Opinion Score	3.86	—	Unverified
6	LSTM-RNN parametric	Mean Opinion Score	3.67	—	Unverified
7	means	Mean Opinion Score	0	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	BDDM vocoder	Mean Opinion Score	4.48	—	Unverified
2	DiffWave LARGE	Mean Opinion Score	4.44	—	Unverified
3	Neural HMM	Mean Opinion Score	3.24	—	Unverified
4	Neural HMM Ablation with 1 state per phone	Mean Opinion Score	2.68	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	WaveNet (L+F)	Mean Opinion Score	4.08	—	Unverified
2	LSTM-RNN parametric	Mean Opinion Score	3.79	—	Unverified
3	HMM-driven concatenative	Mean Opinion Score	3.47	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	SampleRNN (2-tier)	NLL	1.39	—	Unverified
2	SampleRNN (3-tier)	NLL	1.39	—	Unverified