Text-To-Speech Synthesis

Text-To-Speech Synthesis is a machine learning task that involves converting written text into spoken words. The goal is to generate synthetic speech that sounds natural and resembles human speech as closely as possible.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 101–125 of 332 papers

Title	Date	Tasks	Status
AttS2S-VC: Sequence-to-Sequence Voice Conversion with Attention and Context Preservation Mechanisms	Nov 9, 2018	GPUImage Captioning	—Unverified
EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-To-Speech Models	Sep 22, 2022	Speech Synthesistext-to-speech	—Unverified
Exploring Transfer Learning for Urdu Speech Synthesis	Jun 1, 2022	Speech Synthesistext-to-speech	—Unverified
CASSANDRA: A multipurpose configurable voice-enabled human-computer-interface	Apr 1, 2017	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
Fast Bootstrapping of Grapheme to Phoneme System for Under-resourced Languages - Application to the Iban Language	Oct 1, 2013	Speech RecognitionSpeech Synthesis	—Unverified
Environment Aware Text-to-Speech Synthesis	Oct 8, 2021	AttributeDisentanglement	—Unverified
Building Text-to-Speech Systems for Resource Poor Languages	May 1, 2012	ClusteringSpeech Synthesis	—Unverified
Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback	Jun 2, 2024	Speech Synthesistext-to-speech	—Unverified
Building a synchronous corpus of acoustic and 3D facial marker data for adaptive audio-visual speech synthesis	May 1, 2012	Audio-Visual Speech RecognitionSpeech Recognition	—Unverified
An In-depth Analysis of the Effect of Text Normalization in Social Media	May 1, 2015	Dependency Parsingnamed-entity-recognition	—Unverified
Fine-grained Style Modeling, Transfer and Prediction in Text-to-Speech Synthesis via Phone-Level Content-Style Disentanglement	Nov 8, 2020	DisentanglementSpeech Synthesis	—Unverified
Flavored Tacotron: Conditional Learning for Prosodic-linguistic Features	Apr 8, 2021	DecoderSpeech Synthesis	—Unverified
Adaptive Parser-Centric Text Normalization	Aug 1, 2013	Machine TranslationSpeech Recognition	—Unverified
FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis	Jun 30, 2024	CPUDecoder	—Unverified
FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation	May 20, 2025	Dataset GenerationSpeech Synthesis	—Unverified
Full-text Error Correction for Chinese Speech Recognition with Large Language Model	Sep 12, 2024	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
Guided-TTS: A Diffusion Model for Text-to-Speech via Classifier Guidance	Nov 23, 2021	speech-recognitionSpeech Recognition	—Unverified
Generative adversarial network-based glottal waveform model for statistical parametric speech synthesis	Mar 14, 2019	Generative Adversarial NetworkSpeech Synthesis	—Unverified
HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis	Oct 6, 2024	Language ModelingLanguage Modelling	—Unverified
End-to-End Text-to-Speech using Latent Duration based on VQ-VAE	Oct 19, 2020	Speech Synthesistext-to-speech	—Unverified
Generative Pre-training for Speech with Flow Matching	Oct 25, 2023	Speech EnhancementSpeech Synthesis	—Unverified
Generative Semantic Communication for Text-to-Speech Synthesis	Oct 4, 2024	QuantizationSemantic Communication	—Unverified
BUCEADOR, a multi-language search engine for digital libraries	May 1, 2012	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
End-to-End Feedback Loss in Speech Chain Framework via Straight-Through Estimator	Oct 31, 2018	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
Boosting Large Language Model for Speech Synthesis: An Empirical Study	Dec 30, 2023	Language ModelingLanguage Modelling	—Unverified

Show:10 25 50

← PrevPage 5 of 14Next →

All datasets LJSpeech 20000 utterances CMUDict 0.7b HUI speech corpus Thorsten voice 21.02 neutral Trinity Speech-Gesture Dataset

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	NaturalSpeech	Audio Quality MOS	4.56	—	Unverified
2	VITS	Audio Quality MOS	4.43	—	Unverified
3	Grad-TTS + HiFiGAN (1000 steps)	Audio Quality MOS	4.37	—	Unverified
4	FastSpeech 2 + HiFiGAN	Audio Quality MOS	4.34	—	Unverified
5	Glow-TTS + HiFiGAN	Audio Quality MOS	4.34	—	Unverified
6	FastSpeech 2 + HiFiGAN	Audio Quality MOS	4.32	—	Unverified
7	FastDiff (4 steps)	Audio Quality MOS	4.28	—	Unverified
8	FastDiff-TTS	Audio Quality MOS	4.03	—	Unverified
9	Transformer TTS (Mel + WaveGlow)	Audio Quality MOS	3.88	—	Unverified
10	FastSpeech (Mel + WaveGlow)	Audio Quality MOS	3.84	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Mia	10-keyword Speech Commands dataset	16	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Token-Level Ensemble Distillation	Phoneme Error Rate	4.6	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	3.74	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	3.49	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Match-TTSG	MOS	3.7	—	Unverified