Text-To-Speech Synthesis

Text-To-Speech Synthesis is a machine learning task that involves converting written text into spoken words. The goal is to generate synthetic speech that sounds natural and resembles human speech as closely as possible.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 201–250 of 332 papers

Title	Date	Tasks	Status
Multi-speaker Multi-style Text-to-speech Synthesis With Single-speaker Single-style Training Data Scenarios	Dec 23, 2021	DiversitySpeech Synthesis	—Unverified
Multi-speaker Text-to-speech Synthesis Using Deep Gaussian Processes	Aug 7, 2020	Gaussian ProcessesSpeech Synthesis	—Unverified
Multi-Stage Deep Transfer Learning for EmIoT-enabled Human-Computer Interaction	Feb 3, 2022	Human-Object Interaction Detectiontext-to-speech	—Unverified
Multi-step Natural Language Understanding	Aug 1, 2013	Natural Language UnderstandingSpeech Recognition	—Unverified
Grad-StyleSpeech: Any-speaker Adaptive Text-to-Speech Synthesis with Diffusion Models	Nov 17, 2022	Speech Synthesistext-to-speech	—Unverified
An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era	Oct 6, 2022	Speech Synthesistext-to-speech	—Unverified
Neural Harmonic-plus-Noise Waveform Model with Trainable Maximum Voice Frequency for Text-to-Speech Synthesis	Aug 27, 2019	Speech Synthesistext-to-speech	—Unverified
Neural Models of Text Normalization for Speech Applications	Jun 1, 2019	BIG-bench Machine LearningSpeech Synthesis	—Unverified
Neural Speech Synthesis in German	Oct 3, 2021	Speech Synthesistext-to-speech	—Unverified
A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions	Jun 4, 2025	Data AugmentationDiversity	—Unverified
Neural Text Normalization with Subword Units	Jun 1, 2019	Machine TranslationNatural Language Understanding	—Unverified
Neural Text-to-Speech Synthesis for an Under-Resourced Language in a Diglossic Environment: the Case of Gascon Occitan	May 1, 2020	Speech Synthesistext-to-speech	—Unverified
Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters	Jan 10, 2024	Self-Supervised LearningSpeech Enhancement	—Unverified
Normalization of Lithuanian Text Using Regular Expressions	Dec 29, 2023	Speech SynthesisText Normalization	—Unverified
Normalization of Non-Standard Words in Croatian Texts	Mar 27, 2015	FormGeneral Classification	—Unverified
Normalizing Text using Language Modelling based on Phonetics and String Similarity	Jun 25, 2020	Language ModelingLanguage Modelling	—Unverified
Open-Source Boundary-Annotated Corpus for Arabic Speech and Language Processing	May 1, 2012	ChunkingDescriptive	—Unverified
VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature	Apr 2, 2022	Speech Synthesistext-to-speech	—Unverified
An objective evaluation of the effects of recording conditions and speaker characteristics in multi-speaker deep neural speech synthesis	Jun 3, 2021	Speaker VerificationSpeech Synthesis	—Unverified
An In-depth Analysis of the Effect of Text Normalization in Social Media	May 1, 2015	Dependency Parsingnamed-entity-recognition	—Unverified
Parallel WaveNet conditioned on VAE latent vectors	Dec 17, 2020	SentenceSpeech Synthesis	—Unverified
ParrotTTS: Text-to-Speech synthesis by exploiting self-supervised representations	Mar 1, 2023	Self-Supervised LearningSpeech Synthesis	—Unverified
Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis	Jun 4, 2024	In-Context LearningLanguage Modeling	—Unverified
PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS	Mar 28, 2021	Representation LearningText-To-Speech Synthesis	—Unverified
An Experimental Study: Assessing the Combined Framework of WavLM and BEST-RQ for Text-to-Speech Synthesis	Dec 8, 2023	BenchmarkingQuantization	—Unverified
Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis	Aug 4, 2018	Speech Synthesistext-to-speech	—Unverified
Predicting Romanian Stress Assignment	Apr 1, 2014	Speech SynthesisText-To-Speech Synthesis	—Unverified
PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive Prior	Jun 11, 2021	Audio GenerationDenoising	—Unverified
Probing Speaker-specific Features in Speaker Representations	Jan 9, 2025	Self-Supervised LearningSpeaker Verification	—Unverified
A Multi-Agent Framework for Automated Qinqiang Opera Script Generation Using Large Language Models	Apr 22, 2025	cross-modal alignmentScript Generation	—Unverified
PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control	Jan 10, 2025	Speech Synthesistext-to-speech	—Unverified
PSCodec: A Series of High-Fidelity Low-bitrate Neural Speech Codecs Leveraging Prompt Encoders	Apr 3, 2024	Representation LearningSpeaker Verification	—Unverified
ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis	Dec 16, 2024	Speech Synthesistext-to-speech	—Unverified
Prosody-TTS: An end-to-end speech synthesis system with prosody control	Oct 6, 2021	RhythmSpeech Synthesis	—Unverified
Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis	Apr 14, 2025	Language ModelingLanguage Modelling	—Unverified
Punjabi Text-To-Speech Synthesis System	Dec 1, 2012	Speech Synthesistext-to-speech	—Unverified
Wasserstein GAN and Waveform Loss-based Acoustic Model Training for Multi-speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder	Jul 31, 2018	Generative Adversarial NetworkSpeech Synthesis	—Unverified
Waveform generation for text-to-speech synthesis using pitch-synchronous multi-scale generative adversarial networks	Oct 30, 2018	Image GenerationSpeech Synthesis	—Unverified
RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis	Apr 4, 2024	Language ModelingLanguage Modelling	—Unverified
Real-time Incremental Speech-to-Speech Translation of Dialogs	Jun 1, 2012	Machine TranslationSpeech Recognition	—Unverified
ReCAB-VAE: Gumbel-Softmax Variational Inference Based on Analytic Divergence	May 9, 2022	Speech Synthesistext-to-speech	—Unverified
Refer-iTTS: A System for Referring in Spoken Installments to Objects in Real-World Images	Sep 1, 2017	Referring ExpressionReferring expression generation	—Unverified
Reinforcement Learning for Emotional Text-to-Speech Synthesis with Improved Emotion Discriminability	Apr 3, 2021	Emotion Recognitionreinforcement-learning	—Unverified
DLPO: Diffusion Model Loss-Guided Reinforcement Learning for Fine-Tuning Text-to-Speech Diffusion Models	May 23, 2024	Image Generationreinforcement-learning	—Unverified
ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement	Dec 21, 2022	Audio-Visual Speech RecognitionResynthesis	—Unverified
ReVISE: Self-Supervised Speech Resynthesis With Visual Input for Universal and Generalized Speech Regeneration	Jan 1, 2023	Audio-Visual Speech RecognitionResynthesis	—Unverified
Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis	May 25, 2025	Speech Synthesistext-to-speech	—Unverified
R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS	Jun 30, 2022	DecoderGPU	—Unverified
Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization	Jul 2, 2024	Inference OptimizationSpeech Synthesis	—Unverified
RSS-TOBI - A Prosodically Enhanced Romanian Speech Corpus	May 1, 2014	Speech Synthesistext-to-speech	—Unverified

Show:10 25 50

← PrevPage 5 of 7Next →

All datasets LJSpeech 20000 utterances CMUDict 0.7b HUI speech corpus Thorsten voice 21.02 neutral Trinity Speech-Gesture Dataset

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	NaturalSpeech	Audio Quality MOS	4.56	—	Unverified
2	VITS	Audio Quality MOS	4.43	—	Unverified
3	Grad-TTS + HiFiGAN (1000 steps)	Audio Quality MOS	4.37	—	Unverified
4	FastSpeech 2 + HiFiGAN	Audio Quality MOS	4.34	—	Unverified
5	Glow-TTS + HiFiGAN	Audio Quality MOS	4.34	—	Unverified
6	FastSpeech 2 + HiFiGAN	Audio Quality MOS	4.32	—	Unverified
7	FastDiff (4 steps)	Audio Quality MOS	4.28	—	Unverified
8	FastDiff-TTS	Audio Quality MOS	4.03	—	Unverified
9	Transformer TTS (Mel + WaveGlow)	Audio Quality MOS	3.88	—	Unverified
10	FastSpeech (Mel + WaveGlow)	Audio Quality MOS	3.84	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Mia	10-keyword Speech Commands dataset	16	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Token-Level Ensemble Distillation	Phoneme Error Rate	4.6	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	3.74	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	3.49	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Match-TTSG	MOS	3.7	—	Unverified