Text-To-Speech Synthesis

Text-To-Speech Synthesis is a machine learning task that involves converting written text into spoken words. The goal is to generate synthetic speech that sounds natural and resembles human speech as closely as possible.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 201–225 of 332 papers

Title	Date	Tasks	Status
Voice Conversion by Cascading Automatic Speech Recognition and Text-to-Speech Synthesis with Prosody Transfer	Sep 3, 2020	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
Multi-speaker Multi-style Text-to-speech Synthesis With Single-speaker Single-style Training Data Scenarios	Dec 23, 2021	DiversitySpeech Synthesis	—Unverified
Multi-speaker Text-to-speech Synthesis Using Deep Gaussian Processes	Aug 7, 2020	Gaussian ProcessesSpeech Synthesis	—Unverified
Multi-Stage Deep Transfer Learning for EmIoT-enabled Human-Computer Interaction	Feb 3, 2022	Human-Object Interaction Detectiontext-to-speech	—Unverified
Multi-step Natural Language Understanding	Aug 1, 2013	Natural Language UnderstandingSpeech Recognition	—Unverified
Grad-StyleSpeech: Any-speaker Adaptive Text-to-Speech Synthesis with Diffusion Models	Nov 17, 2022	Speech Synthesistext-to-speech	—Unverified
An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era	Oct 6, 2022	Speech Synthesistext-to-speech	—Unverified
Neural Harmonic-plus-Noise Waveform Model with Trainable Maximum Voice Frequency for Text-to-Speech Synthesis	Aug 27, 2019	Speech Synthesistext-to-speech	—Unverified
Neural Models of Text Normalization for Speech Applications	Jun 1, 2019	BIG-bench Machine LearningSpeech Synthesis	—Unverified
Neural Speech Synthesis in German	Oct 3, 2021	Speech Synthesistext-to-speech	—Unverified
A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions	Jun 4, 2025	Data AugmentationDiversity	—Unverified
Neural Text Normalization with Subword Units	Jun 1, 2019	Machine TranslationNatural Language Understanding	—Unverified
Neural Text-to-Speech Synthesis for an Under-Resourced Language in a Diglossic Environment: the Case of Gascon Occitan	May 1, 2020	Speech Synthesistext-to-speech	—Unverified
Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters	Jan 10, 2024	Self-Supervised LearningSpeech Enhancement	—Unverified
Normalization of Lithuanian Text Using Regular Expressions	Dec 29, 2023	Speech SynthesisText Normalization	—Unverified
Normalization of Non-Standard Words in Croatian Texts	Mar 27, 2015	FormGeneral Classification	—Unverified
Normalizing Text using Language Modelling based on Phonetics and String Similarity	Jun 25, 2020	Language ModelingLanguage Modelling	—Unverified
Open-Source Boundary-Annotated Corpus for Arabic Speech and Language Processing	May 1, 2012	ChunkingDescriptive	—Unverified
VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature	Apr 2, 2022	Speech Synthesistext-to-speech	—Unverified
An objective evaluation of the effects of recording conditions and speaker characteristics in multi-speaker deep neural speech synthesis	Jun 3, 2021	Speaker VerificationSpeech Synthesis	—Unverified
An In-depth Analysis of the Effect of Text Normalization in Social Media	May 1, 2015	Dependency Parsingnamed-entity-recognition	—Unverified
Parallel WaveNet conditioned on VAE latent vectors	Dec 17, 2020	SentenceSpeech Synthesis	—Unverified
ParrotTTS: Text-to-Speech synthesis by exploiting self-supervised representations	Mar 1, 2023	Self-Supervised LearningSpeech Synthesis	—Unverified
Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis	Jun 4, 2024	In-Context LearningLanguage Modeling	—Unverified
PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS	Mar 28, 2021	Representation LearningText-To-Speech Synthesis	—Unverified

Show:10 25 50

← PrevPage 9 of 14Next →

All datasets LJSpeech 20000 utterances CMUDict 0.7b HUI speech corpus Thorsten voice 21.02 neutral Trinity Speech-Gesture Dataset

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	NaturalSpeech	Audio Quality MOS	4.56	—	Unverified
2	VITS	Audio Quality MOS	4.43	—	Unverified
3	Grad-TTS + HiFiGAN (1000 steps)	Audio Quality MOS	4.37	—	Unverified
4	FastSpeech 2 + HiFiGAN	Audio Quality MOS	4.34	—	Unverified
5	Glow-TTS + HiFiGAN	Audio Quality MOS	4.34	—	Unverified
6	FastSpeech 2 + HiFiGAN	Audio Quality MOS	4.32	—	Unverified
7	FastDiff (4 steps)	Audio Quality MOS	4.28	—	Unverified
8	FastDiff-TTS	Audio Quality MOS	4.03	—	Unverified
9	Transformer TTS (Mel + WaveGlow)	Audio Quality MOS	3.88	—	Unverified
10	FastSpeech (Mel + WaveGlow)	Audio Quality MOS	3.84	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Mia	10-keyword Speech Commands dataset	16	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Token-Level Ensemble Distillation	Phoneme Error Rate	4.6	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	3.74	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	3.49	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Match-TTSG	MOS	3.7	—	Unverified