Text-To-Speech Synthesis

Text-To-Speech Synthesis is a machine learning task that involves converting written text into spoken words. The goal is to generate synthetic speech that sounds natural and resembles human speech as closely as possible.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 201–225 of 332 papers

Title	Date	Tasks	Status
Towards Lifelong Learning of Multilingual Text-To-Speech Synthesis	Oct 9, 2021	Lifelong learningSpeech Synthesis	CodeCode Available
Environment Aware Text-to-Speech Synthesis	Oct 8, 2021	AttributeDisentanglement	—Unverified
Prosody-TTS: An end-to-end speech synthesis system with prosody control	Oct 6, 2021	RhythmSpeech Synthesis	—Unverified
Neural Speech Synthesis in German	Oct 3, 2021	Speech Synthesistext-to-speech	—Unverified
Guided-TTS:Text-to-Speech with Untranscribed Speech	Sep 29, 2021	Speech Synthesistext-to-speech	—Unverified
Conditioning Sequence-to-sequence Networks with Learned Activations	Sep 29, 2021	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
Low-Latency Incremental Text-to-Speech Synthesis with Distilled Context Prediction Network	Sep 22, 2021	Knowledge DistillationLanguage Modeling	—Unverified
A Unified Transformer-based Framework for Duplex Text Normalization	Aug 23, 2021	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
Extending Text-to-Speech Synthesis with Articulatory Movement Prediction using Ultrasound Tongue Imaging	Jul 12, 2021	PredictionSpeech Synthesis	CodeCode Available
Location, Location: Enhancing the Evaluation of Text-to-Speech Synthesis Using the Rapid Prosody Transcription Paradigm	Jul 6, 2021	Speech Synthesistext-to-speech	—Unverified
Speech Synthesis from Text and Ultrasound Tongue Image-based Articulatory Input	Jul 5, 2021	Speech Synthesistext-to-speech	CodeCode Available
PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive Prior	Jun 11, 2021	Audio GenerationDenoising	—Unverified
An objective evaluation of the effects of recording conditions and speaker characteristics in multi-speaker deep neural speech synthesis	Jun 3, 2021	Speaker VerificationSpeech Synthesis	—Unverified
Speaker verification-derived loss and data augmentation for DNN-based multispeaker speech synthesis	Jun 3, 2021	Data AugmentationSpeaker Verification	—Unverified
Dual Script E2E framework for Multilingual and Code-Switching ASR	Jun 2, 2021	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
Phrase break prediction with bidirectional encoder representations in Japanese text-to-speech synthesis	Apr 26, 2021	Language ModelingLanguage Modelling	CodeCode Available
Enhancing Word-Level Semantic Representation via Dependency Structure for Expressive Text-to-Speech Synthesis	Apr 14, 2021	Dependency ParsingRepresentation Learning	—Unverified
Flavored Tacotron: Conditional Learning for Prosodic-linguistic Features	Apr 8, 2021	DecoderSpeech Synthesis	—Unverified
Reinforcement Learning for Emotional Text-to-Speech Synthesis with Improved Emotion Discriminability	Apr 3, 2021	Emotion Recognitionreinforcement-learning	—Unverified
PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS	Mar 28, 2021	Representation LearningText-To-Speech Synthesis	—Unverified
Continual Speaker Adaptation for Text-to-Speech Synthesis	Mar 26, 2021	Continual LearningDiversity	—Unverified
Alternate Endings: Improving Prosody for Incremental Neural TTS with Predicted Future Text Input	Feb 19, 2021	Language ModelingLanguage Modelling	—Unverified
VARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based on Very Deep VAE with Residual Attention	Feb 12, 2021	Speech Synthesistext-to-speech	—Unverified
Voice Cloning: a Multi-Speaker Text-to-Speech Synthesis Approach based on Transfer Learning	Feb 10, 2021	Speech Synthesistext-to-speech	—Unverified
Triple M: A Practical Text-to-speech Synthesis System With Multi-guidance Attention And Multi-band Multi-time LPCNet	Jan 30, 2021	CPUSentence	—Unverified

Show:10 25 50

← PrevPage 9 of 14Next →

All datasets LJSpeech 20000 utterances CMUDict 0.7b HUI speech corpus Thorsten voice 21.02 neutral Trinity Speech-Gesture Dataset

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	NaturalSpeech	Audio Quality MOS	4.56	—	Unverified
2	VITS	Audio Quality MOS	4.43	—	Unverified
3	Grad-TTS + HiFiGAN (1000 steps)	Audio Quality MOS	4.37	—	Unverified
4	FastSpeech 2 + HiFiGAN	Audio Quality MOS	4.34	—	Unverified
5	Glow-TTS + HiFiGAN	Audio Quality MOS	4.34	—	Unverified
6	FastSpeech 2 + HiFiGAN	Audio Quality MOS	4.32	—	Unverified
7	FastDiff (4 steps)	Audio Quality MOS	4.28	—	Unverified
8	FastDiff-TTS	Audio Quality MOS	4.03	—	Unverified
9	Transformer TTS (Mel + WaveGlow)	Audio Quality MOS	3.88	—	Unverified
10	FastSpeech (Mel + WaveGlow)	Audio Quality MOS	3.84	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Mia	10-keyword Speech Commands dataset	16	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Token-Level Ensemble Distillation	Phoneme Error Rate	4.6	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	3.74	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	3.49	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Match-TTSG	MOS	3.7	—	Unverified