Speech Synthesis

Speech synthesis is the task of generating speech from some other modality like text, lip movements etc.

Please note that the leaderboards here are not really comparable between studies - as they use mean opinion score as a metric and collect different samples from Amazon Mechnical Turk.

( Image credit: WaveNet: A generative model for raw audio )

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 651–700 of 1249 papers

Title	Date	Tasks	Status
Rhythm-controllable Attention with High Robustness for Long Sentence Speech Synthesis	Jun 5, 2023	RhythmSentence	—Unverified
R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS	Jun 30, 2022	DecoderGPU	—Unverified
Robotic Speech Synthesis: Perspectives on Interactions, Scenarios, and Ethics	Mar 17, 2022	EthicsSpeech Synthesis	—Unverified
RobustL2S: Speaker-Specific Lip-to-Speech Synthesis exploiting Self-Supervised Representations	Jul 3, 2023	Lip to Speech SynthesisSpeaker-Specific Lip to Speech Synthesis	—Unverified
Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization	Jul 2, 2024	Inference OptimizationSpeech Synthesis	—Unverified
RoVo: Robust Voice Protection Against Unauthorized Speech Synthesis with Embedding-Level Perturbations	May 19, 2025	Speaker VerificationSpeech Enhancement	—Unverified
RSS-TOBI - A Prosodically Enhanced Romanian Speech Corpus	May 1, 2014	Speech Synthesistext-to-speech	—Unverified
RUSLAN: Russian Spoken Language Corpus for Speech Synthesis	Jun 26, 2019	Speech Synthesistext-to-speech	—Unverified
Russian Stress Prediction using Maximum Entropy Ranking	Oct 1, 2013	Machine TranslationPrediction	—Unverified
S2ST-Omni: An Efficient and Scalable Multilingual Speech-to-Speech Translation Framework via Seamless Speech-Text Alignment and Streaming Speech Generation	Jun 11, 2025	Reading ComprehensionSpeech Synthesis	—Unverified
SALF-MOS: Speaker Agnostic Latent Features Downsampled for MOS Prediction	Jun 2, 2025	Speech Synthesistext-to-speech	—Unverified
SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis	Aug 2, 2023	DecoderSelf-Supervised Learning	—Unverified
Sampling-based speech parameter generation using moment-matching networks	Apr 12, 2017	Speech Synthesis	—Unverified
Scaling and bias codes for modeling speaker-adaptive DNN-based speech synthesis systems	Jul 31, 2018	Speech Synthesis	—Unverified
Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis	Dec 6, 2023	Speech Synthesistext-to-speech	—Unverified
Securing Voice-driven Interfaces against Fake (Cloned) Audio Attacks	Feb 18, 2019	Speech SynthesisVoice Cloning	—Unverified
Seeing Voices: Generating A-Roll Video from Audio with Mirage	Jun 9, 2025	Speech Synthesistext-to-speech	—Unverified
SelectTTS: Synthesizing Anyone's Voice via Discrete Unit-Based Frame Selection	Aug 30, 2024	Self-Supervised LearningSpeech Synthesis	—Unverified
Self-Attention Linguistic-Acoustic Decoder	Aug 31, 2018	CPUDecoder	—Unverified
Self-supervised Context-aware Style Representation for Expressive Speech Synthesis	Jun 25, 2022	Contrastive LearningDeep Clustering	—Unverified
Self-supervised learning for robust voice cloning	Apr 7, 2022	Self-Supervised LearningSpeech Synthesis	—Unverified
SelfVC: Voice Conversion With Iterative Refinement using Self Transformations	Oct 14, 2023	Self-Supervised LearningSpeaker Verification	—Unverified
Semantics and Discourse Processing for Expressive TTS	Sep 1, 2015	Speech Synthesis	—Unverified
SE-MelGAN -- Speaker Agnostic Rapid Speech Enhancement	Jun 13, 2020	CPUGPU	—Unverified
Semi-Supervised Generative Modeling for Controllable Speech Synthesis	Oct 3, 2019	Speech Synthesistext-to-speech	—Unverified
Semi-Supervised Learning Based on Reference Model for Low-resource TTS	Oct 25, 2022	Speech Synthesistext-to-speech	—Unverified
Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation	May 16, 2020	DecoderSpeech Synthesis	—Unverified
Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis	Aug 30, 2018	DecoderSpeech Synthesis	—Unverified
Sentiment Analysis for Emotional Speech Synthesis in a News Dialogue System	Dec 1, 2020	ArticlesEmotional Speech Synthesis	—Unverified
Sequence Modeling using Gated Recurrent Neural Networks	Jan 1, 2015	Machine TranslationSpeech Synthesis	—Unverified
Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis	May 18, 2025	Speech Synthesistext-to-speech	—Unverified
Significance of Maximum Spectral Amplitude in Sub-bands for Spectral Envelope Estimation and Its Application to Statistical Parametric Speech Synthesis	Aug 3, 2015	Speech Synthesis	—Unverified
Silent Speech Interfaces for Speech Restoration: A Review	Sep 4, 2020	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
Simple and Effective Unsupervised Speech Synthesis	Apr 6, 2022	speech-recognitionSpeech Recognition	—Unverified
Simple and Effective Unsupervised Speech Translation	Oct 18, 2022	Domain AdaptationMachine Translation	—Unverified
Simultaneous Speech-to-Speech Translation System with Neural Incremental ASR, MT, and TTS	Nov 10, 2020	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
Simultaneous Translation	Nov 1, 2020	Machine Translationspeech-recognition	—Unverified
SingGAN: Generative Adversarial Network For High-Fidelity Singing Voice Generation	Oct 14, 2021	Generative Adversarial NetworkGPU	—Unverified
Single-stage TTS with Masked Audio Token Modeling and Semantic Knowledge Distillation	Sep 17, 2024	Knowledge DistillationSpeech Synthesis	—Unverified
Situated Incremental Natural Language Understanding using a Multimodal, Linguistically-driven Update Model	Aug 1, 2014	Dialogue ManagementNatural Language Understanding	—Unverified
SlimSpeech: Lightweight and Efficient Text-to-Speech with Slim Rectified Flow	Apr 10, 2025	Speech Synthesistext-to-speech	—Unverified
SLMGAN: Exploiting Speech Language Model Representations for Unsupervised Zero-Shot Voice Conversion in GANs	Jul 18, 2023	Generative Adversarial NetworkLanguage Modeling	—Unverified
SNAC: Speaker-normalized affine coupling layer in flow-based architecture for zero-shot multi-speaker text-to-speech	Nov 30, 2022	Speech Synthesistext-to-speech	—Unverified
SOLIDO: A Robust Watermarking Method for Speech Synthesis via Low-Rank Adaptation	Apr 21, 2025	parameter-efficient fine-tuningSpeech Synthesis	—Unverified
SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis	Apr 6, 2022	Speech Synthesistext-to-speech	—Unverified
SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation	Jul 27, 2022	Language ModelingLanguage Modelling	—Unverified
MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis	Feb 26, 2025	Speech Synthesistext-to-speech	—Unverified
Speaker Adaption with Intuitive Prosodic Features for Statistical Parametric Speech Synthesis	Mar 2, 2022	Speech Synthesis	—Unverified
Speaker-adaptive neural vocoders for parametric speech synthesis systems	Nov 8, 2018	Speech Synthesistext-to-speech	—Unverified
Speaker Anonymization Using X-vector and Neural Waveform Models	May 30, 2019	Speaker anonymizationSpeaker Verification	—Unverified

Show:10 25 50

← PrevPage 14 of 25Next →

All datasets LibriTTS North American English LJSpeech Mandarin Chinese Blizzard Challenge 2013

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	PeriodWave-Turbo-L	PESQ	4.45	—	Unverified
2	BigVGAN-v2	PESQ	4.36	—	Unverified
3	EVA-GAN-big	PESQ	4.35	—	Unverified
4	PeriodWave + FreeU	PESQ	4.25	—	Unverified
5	RFWave	PESQ	4.23	—	Unverified
6	BigVSAN (w/ snakebeta)	PESQ	4.12	—	Unverified
7	BigVSAN	PESQ	4.12	—	Unverified
8	EVA-GAN-base	PESQ	4.03	—	Unverified
9	BigVGAN	PESQ	4.03	—	Unverified
10	Vocos	PESQ	3.7	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	4.53	—	Unverified
2	WaveNet (Linguistic)	Mean Opinion Score	4.34	—	Unverified
3	WaveNet (L+F)	Mean Opinion Score	4.21	—	Unverified
4	Tacotron	Mean Opinion Score	4	—	Unverified
5	HMM-driven concatenative	Mean Opinion Score	3.86	—	Unverified
6	LSTM-RNN parametric	Mean Opinion Score	3.67	—	Unverified
7	means	Mean Opinion Score	0	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	BDDM vocoder	Mean Opinion Score	4.48	—	Unverified
2	DiffWave LARGE	Mean Opinion Score	4.44	—	Unverified
3	Neural HMM	Mean Opinion Score	3.24	—	Unverified
4	Neural HMM Ablation with 1 state per phone	Mean Opinion Score	2.68	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	WaveNet (L+F)	Mean Opinion Score	4.08	—	Unverified
2	LSTM-RNN parametric	Mean Opinion Score	3.79	—	Unverified
3	HMM-driven concatenative	Mean Opinion Score	3.47	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	SampleRNN (2-tier)	NLL	1.39	—	Unverified
2	SampleRNN (3-tier)	NLL	1.39	—	Unverified