Text-To-Speech Synthesis

Text-To-Speech Synthesis is a machine learning task that involves converting written text into spoken words. The goal is to generate synthetic speech that sounds natural and resembles human speech as closely as possible.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 101–125 of 332 papers

Title	Date	Tasks	Status	Score
Mlphon: A Multifunctional Grapheme-Phoneme Conversion Tool Using Finite State Transducers	Sep 5, 2022	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	CodeCode Available	5
Independent and automatic evaluation of acoustic-to-articulatory inversion models	Nov 15, 2019	speech-recognitionSpeech Recognition	CodeCode Available	5
Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent language	Oct 29, 2018	Speech Synthesistext-to-speech	CodeCode Available	5
Speech Synthesis from Text and Ultrasound Tongue Image-based Articulatory Input	Jul 5, 2021	Speech Synthesistext-to-speech	CodeCode Available	5
Creating New Language and Voice Components for the Updated MaryTTS Text-to-Speech Synthesis Platform	Dec 13, 2017	Speech Synthesistext-to-speech	—Unverified	0
Controllable Prosody Generation With Partial Inputs	Mar 14, 2023	Speech Synthesistext-to-speech	—Unverified	0
A unified sequence-to-sequence front-end model for Mandarin text-to-speech synthesis	Nov 11, 2019	Polyphone disambiguationSpeech Synthesis	—Unverified	0
Controllable neural text-to-speech synthesis using intuitive prosodic features	Sep 14, 2020	SentenceSpeech Synthesis	—Unverified	0
Controllable Accented Text-to-Speech Synthesis	Sep 22, 2022	Speech Synthesistext-to-speech	—Unverified	0
A unified front-end framework for English text-to-speech synthesis	May 18, 2023	Speech SynthesisText Normalization	—Unverified	0
An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era	Oct 6, 2022	Speech Synthesistext-to-speech	—Unverified	0
Continual Speaker Adaptation for Text-to-Speech Synthesis	Mar 26, 2021	Continual LearningDiversity	—Unverified	0
Full-text Error Correction for Chinese Speech Recognition with Large Language Model	Sep 12, 2024	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified	0
FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation	May 20, 2025	Dataset GenerationSpeech Synthesis	—Unverified	0
Conditioning Sequence-to-sequence Networks with Learned Activations	Sep 29, 2021	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified	0
A Unified Framework for Collecting Text-to-Speech Synthesis Datasets for 22 Indian Languages	Oct 18, 2024	Speech Synthesistext-to-speech	—Unverified	0
FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis	Jun 30, 2024	CPUDecoder	—Unverified	0
Generative adversarial network-based glottal waveform model for statistical parametric speech synthesis	Mar 14, 2019	Generative Adversarial NetworkSpeech Synthesis	—Unverified	0
Flavored Tacotron: Conditional Learning for Prosodic-linguistic Features	Apr 8, 2021	DecoderSpeech Synthesis	—Unverified	0
Fine-grained Style Modeling, Transfer and Prediction in Text-to-Speech Synthesis via Phone-Level Content-Style Disentanglement	Nov 8, 2020	DisentanglementSpeech Synthesis	—Unverified	0
Generative Pre-training for Speech with Flow Matching	Oct 25, 2023	Speech EnhancementSpeech Synthesis	—Unverified	0
Generative Semantic Communication for Text-to-Speech Synthesis	Oct 4, 2024	QuantizationSemantic Communication	—Unverified	0
Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech	Jul 31, 2023	Acoustic ModellingSpeech Synthesis	—Unverified	0
Augmenting Images for ASR and TTS through Single-loop and Dual-loop Multimodal Chain Framework	Nov 4, 2020	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified	0
A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions	Jun 4, 2025	Data AugmentationDiversity	—Unverified	0

Show:10 25 50

← PrevPage 5 of 14Next →

All datasets LJSpeech 20000 utterances CMUDict 0.7b HUI speech corpus Thorsten voice 21.02 neutral Trinity Speech-Gesture Dataset

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	NaturalSpeech	Audio Quality MOS	4.56	—	Unverified
2	VITS	Audio Quality MOS	4.43	—	Unverified
3	Grad-TTS + HiFiGAN (1000 steps)	Audio Quality MOS	4.37	—	Unverified
4	FastSpeech 2 + HiFiGAN	Audio Quality MOS	4.34	—	Unverified
5	Glow-TTS + HiFiGAN	Audio Quality MOS	4.34	—	Unverified
6	FastSpeech 2 + HiFiGAN	Audio Quality MOS	4.32	—	Unverified
7	FastDiff (4 steps)	Audio Quality MOS	4.28	—	Unverified
8	FastDiff-TTS	Audio Quality MOS	4.03	—	Unverified
9	Transformer TTS (Mel + WaveGlow)	Audio Quality MOS	3.88	—	Unverified
10	FastSpeech (Mel + WaveGlow)	Audio Quality MOS	3.84	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Mia	10-keyword Speech Commands dataset	16	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Token-Level Ensemble Distillation	Phoneme Error Rate	4.6	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	3.74	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	3.49	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Match-TTSG	MOS	3.7	—	Unverified