Text-To-Speech Synthesis

Text-To-Speech Synthesis is a machine learning task that involves converting written text into spoken words. The goal is to generate synthetic speech that sounds natural and resembles human speech as closely as possible.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 226–250 of 332 papers

Title	Date	Tasks	Status
Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis	Aug 4, 2018	Speech Synthesistext-to-speech	—Unverified
Predicting Romanian Stress Assignment	Apr 1, 2014	Speech SynthesisText-To-Speech Synthesis	—Unverified
PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive Prior	Jun 11, 2021	Audio GenerationDenoising	—Unverified
Probing Speaker-specific Features in Speaker Representations	Jan 9, 2025	Self-Supervised LearningSpeaker Verification	—Unverified
A Multi-Agent Framework for Automated Qinqiang Opera Script Generation Using Large Language Models	Apr 22, 2025	cross-modal alignmentScript Generation	—Unverified
PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control	Jan 10, 2025	Speech Synthesistext-to-speech	—Unverified
PSCodec: A Series of High-Fidelity Low-bitrate Neural Speech Codecs Leveraging Prompt Encoders	Apr 3, 2024	Representation LearningSpeaker Verification	—Unverified
ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis	Dec 16, 2024	Speech Synthesistext-to-speech	—Unverified
Prosody-TTS: An end-to-end speech synthesis system with prosody control	Oct 6, 2021	RhythmSpeech Synthesis	—Unverified
Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis	Apr 14, 2025	Language ModelingLanguage Modelling	—Unverified
Punjabi Text-To-Speech Synthesis System	Dec 1, 2012	Speech Synthesistext-to-speech	—Unverified
Wasserstein GAN and Waveform Loss-based Acoustic Model Training for Multi-speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder	Jul 31, 2018	Generative Adversarial NetworkSpeech Synthesis	—Unverified
Waveform generation for text-to-speech synthesis using pitch-synchronous multi-scale generative adversarial networks	Oct 30, 2018	Image GenerationSpeech Synthesis	—Unverified
RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis	Apr 4, 2024	Language ModelingLanguage Modelling	—Unverified
Real-time Incremental Speech-to-Speech Translation of Dialogs	Jun 1, 2012	Machine TranslationSpeech Recognition	—Unverified
ReCAB-VAE: Gumbel-Softmax Variational Inference Based on Analytic Divergence	May 9, 2022	Speech Synthesistext-to-speech	—Unverified
Refer-iTTS: A System for Referring in Spoken Installments to Objects in Real-World Images	Sep 1, 2017	Referring ExpressionReferring expression generation	—Unverified
Reinforcement Learning for Emotional Text-to-Speech Synthesis with Improved Emotion Discriminability	Apr 3, 2021	Emotion Recognitionreinforcement-learning	—Unverified
DLPO: Diffusion Model Loss-Guided Reinforcement Learning for Fine-Tuning Text-to-Speech Diffusion Models	May 23, 2024	Image Generationreinforcement-learning	—Unverified
ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement	Dec 21, 2022	Audio-Visual Speech RecognitionResynthesis	—Unverified
ReVISE: Self-Supervised Speech Resynthesis With Visual Input for Universal and Generalized Speech Regeneration	Jan 1, 2023	Audio-Visual Speech RecognitionResynthesis	—Unverified
Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis	May 25, 2025	Speech Synthesistext-to-speech	—Unverified
R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS	Jun 30, 2022	DecoderGPU	—Unverified
Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization	Jul 2, 2024	Inference OptimizationSpeech Synthesis	—Unverified
RSS-TOBI - A Prosodically Enhanced Romanian Speech Corpus	May 1, 2014	Speech Synthesistext-to-speech	—Unverified

Show:10 25 50

← PrevPage 10 of 14Next →

All datasets LJSpeech 20000 utterances CMUDict 0.7b HUI speech corpus Thorsten voice 21.02 neutral Trinity Speech-Gesture Dataset

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	NaturalSpeech	Audio Quality MOS	4.56	—	Unverified
2	VITS	Audio Quality MOS	4.43	—	Unverified
3	Grad-TTS + HiFiGAN (1000 steps)	Audio Quality MOS	4.37	—	Unverified
4	FastSpeech 2 + HiFiGAN	Audio Quality MOS	4.34	—	Unverified
5	Glow-TTS + HiFiGAN	Audio Quality MOS	4.34	—	Unverified
6	FastSpeech 2 + HiFiGAN	Audio Quality MOS	4.32	—	Unverified
7	FastDiff (4 steps)	Audio Quality MOS	4.28	—	Unverified
8	FastDiff-TTS	Audio Quality MOS	4.03	—	Unverified
9	Transformer TTS (Mel + WaveGlow)	Audio Quality MOS	3.88	—	Unverified
10	FastSpeech (Mel + WaveGlow)	Audio Quality MOS	3.84	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Mia	10-keyword Speech Commands dataset	16	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Token-Level Ensemble Distillation	Phoneme Error Rate	4.6	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	3.74	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	3.49	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Match-TTSG	MOS	3.7	—	Unverified