Text-To-Speech Synthesis

Text-To-Speech Synthesis is a machine learning task that involves converting written text into spoken words. The goal is to generate synthetic speech that sounds natural and resembles human speech as closely as possible.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–50 of 332 papers

Title	Date	Tasks	Status	Hype
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers	Jan 5, 2023	In-Context LearningLanguage Modeling	CodeCode Available	7
ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech	Nov 7, 2022	Representation LearningSpeech Representation Learning	CodeCode Available	6
PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit	May 20, 2022	AllAutomatic Speech Recognition (ASR)	CodeCode Available	6
StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning	Jun 5, 2024	Automatic Speech Recognition (ASR)de-en	CodeCode Available	5
Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling	Mar 7, 2023	In-Context LearningLanguage Modeling	CodeCode Available	5
ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching	Jun 16, 2025	DecoderSpeech Synthesis	CodeCode Available	4
Enhancing Suno's Bark Text-to-Speech Model: Addressing Limitations Through Meta's Encodec and Pre-Trained Hubert	Apr 18, 2023	Audio GenerationExpressive Speech Synthesis	CodeCode Available	4
MoonCast: High-Quality Zero-Shot Podcast Generation	Mar 18, 2025	Speech Synthesistext-to-speech	CodeCode Available	3
Matcha-TTS: A fast TTS architecture with conditional flow matching	Sep 6, 2023	Acoustic ModellingDecoder	CodeCode Available	3
ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech	Jul 13, 2022	DenoisingGPU	CodeCode Available	3
Efficient Neural Audio Synthesis	Feb 23, 2018	Audio SynthesisCPU	CodeCode Available	2
StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis	May 30, 2022	Data AugmentationSelf-Supervised Learning	CodeCode Available	2
FastSpeech: Fast,Robustand Controllable Text-to-Speech	May 22, 2019	Decodertext-to-speech	CodeCode Available	2
EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech	Jun 12, 2024	Emotional Speech Synthesistext-to-speech	CodeCode Available	2
A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech	Feb 8, 2023	Code GenerationDiversity	CodeCode Available	2
Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram	Oct 25, 2019	Generative Adversarial NetworkGPU	CodeCode Available	2
Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis	Oct 30, 2024	Speech Synthesistext-to-speech	CodeCode Available	2
FastSpeech: Fast, Robust and Controllable Text to Speech	May 22, 2019	DecoderSpeech Synthesis	CodeCode Available	2
Towards Building Text-To-Speech Systems for the Next Billion Users	Nov 17, 2022	DiversitySpeech Synthesis	CodeCode Available	2
LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT	Oct 7, 2023	Audio captioningAutomatic Speech Recognition	CodeCode Available	2
Generative Modeling for Low Dimensional Speech Attributes with Neural Spline Flows	Mar 3, 2022	Speech Synthesistext-to-speech	CodeCode Available	2
iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform	Mar 4, 2022	Speech Synthesistext-to-speech	CodeCode Available	2
Neural Speech Synthesis with Transformer Network	Sep 19, 2018	DecoderMachine Translation	CodeCode Available	2
FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec	Sep 14, 2023	Automatic Speech Recognitionspeech-recognition	CodeCode Available	2
GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech	May 15, 2022	Speech SynthesisStyle Transfer	CodeCode Available	2
Sample-Efficient Diffusion for Text-To-Speech Synthesis	Sep 1, 2024	Language ModelingLanguage Modelling	CodeCode Available	2
SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis	Sep 11, 2024	DecoderSpeech Synthesis	CodeCode Available	2
PortaSpeech: Portable and High-Quality Generative Text-to-Speech	Sep 30, 2021	text-to-speechText to Speech	CodeCode Available	2
NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality	May 9, 2022	SentenceSpeech Synthesis	CodeCode Available	2
CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models	Mar 31, 2024	DenoisingSpeech Synthesis	CodeCode Available	2
DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism	May 6, 2021	Generative Adversarial NetworkSinging Voice Synthesis	CodeCode Available	2
FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis	Apr 21, 2022	DenoisingGPU	CodeCode Available	2
Generative Adversarial Training for Text-to-Speech Synthesis Based on Raw Phonetic Input and Explicit Prosody Modelling	Oct 14, 2023	Speech Synthesistext-to-speech	CodeCode Available	2
Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus	Dec 20, 2021	Audio GenerationSinging Voice Synthesis	CodeCode Available	1
Multilingual Text-to-Speech Synthesis for Turkic Languages Using Transliteration	May 25, 2023	Speech Synthesistext-to-speech	CodeCode Available	1
MnTTS: An Open-Source Mongolian Text-to-Speech Synthesis Dataset and Accompanied Baseline	Sep 22, 2022	Speech Synthesistext-to-speech	CodeCode Available	1
MnTTS2: An Open-Source Multi-Speaker Mongolian Text-to-Speech Synthesis Dataset	Dec 11, 2022	Speech Synthesistext-to-speech	CodeCode Available	1
Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models	May 21, 2025	Bayesian OptimizationSpeech Synthesis	CodeCode Available	1
Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation	Aug 3, 2023	DecoderQuantization	CodeCode Available	1
Learning Arousal-Valence Representation from Categorical Emotion Labels of Speech	Nov 24, 2023	Dimensionality ReductionEmotion Classification	CodeCode Available	1
KazakhTTS: An Open-Source Kazakh Text-to-Speech Synthesis Dataset	Apr 17, 2021	Speech Synthesistext-to-speech	CodeCode Available	1
KazEmoTTS: A Dataset for Kazakh Emotional Text-to-Speech Synthesis	Apr 1, 2024	Speech Synthesistext-to-speech	CodeCode Available	1
UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts	Apr 29, 2024	Contrastive LearningSpeech Synthesis	CodeCode Available	1
Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search	May 22, 2020	text-to-speechText to Speech	CodeCode Available	1
Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech	May 13, 2021	DecoderSpeech Synthesis	CodeCode Available	1
Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech	Feb 27, 2023	Speech Synthesistext-to-speech	CodeCode Available	1
Fine-grained style control in Transformer-based Text-to-speech Synthesis	Oct 12, 2021	Inductive BiasSpeech Synthesis	CodeCode Available	1
ArTST: Arabic Text and Speech Transformer	Oct 25, 2023	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	CodeCode Available	1
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech	Jun 8, 2020	Knowledge DistillationSpeech Synthesis	CodeCode Available	1
Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis	May 12, 2020	Speech SynthesisStyle Transfer	CodeCode Available	1

Show:10 25 50

← PrevPage 1 of 7Next →

All datasets LJSpeech 20000 utterances CMUDict 0.7b HUI speech corpus Thorsten voice 21.02 neutral Trinity Speech-Gesture Dataset

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	NaturalSpeech	Audio Quality MOS	4.56	—	Unverified
2	VITS	Audio Quality MOS	4.43	—	Unverified
3	Grad-TTS + HiFiGAN (1000 steps)	Audio Quality MOS	4.37	—	Unverified
4	FastSpeech 2 + HiFiGAN	Audio Quality MOS	4.34	—	Unverified
5	Glow-TTS + HiFiGAN	Audio Quality MOS	4.34	—	Unverified
6	FastSpeech 2 + HiFiGAN	Audio Quality MOS	4.32	—	Unverified
7	FastDiff (4 steps)	Audio Quality MOS	4.28	—	Unverified
8	FastDiff-TTS	Audio Quality MOS	4.03	—	Unverified
9	Transformer TTS (Mel + WaveGlow)	Audio Quality MOS	3.88	—	Unverified
10	FastSpeech (Mel + WaveGlow)	Audio Quality MOS	3.84	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Mia	10-keyword Speech Commands dataset	16	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Token-Level Ensemble Distillation	Phoneme Error Rate	4.6	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	3.74	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	3.49	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Match-TTSG	MOS	3.7	—	Unverified