Text-To-Speech Synthesis

Text-To-Speech Synthesis is a machine learning task that involves converting written text into spoken words. The goal is to generate synthetic speech that sounds natural and resembles human speech as closely as possible.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 151–200 of 332 papers

Title	Date	Tasks	Status
Guided-TTS: A Diffusion Model for Text-to-Speech via Classifier Guidance	Nov 23, 2021	speech-recognitionSpeech Recognition	—Unverified
HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis	Oct 6, 2024	Language ModelingLanguage Modelling	—Unverified
Hierarchical Multi-Grained Generative Model for Expressive Speech Synthesis	Sep 17, 2020	Expressive Speech SynthesisSpeech Synthesis	—Unverified
Hierarchical Representation of Prosody for Statistical Speech Synthesis	Oct 7, 2015	Speech Synthesistext-to-speech	—Unverified
High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units	Jun 29, 2023	Speech Synthesistext-to-speech	—Unverified
Hippocratic Abbreviation Expansion	Jun 1, 2014	Information RetrievalMachine Translation	—Unverified
HMM-based Mandarin Singing Voice Synthesis Using Tailored Synthesis Units and Question Sets	Dec 1, 2013	Singing Voice SynthesisSpeech Synthesis	—Unverified
UzbekTagger: The rule-based POS tagger for Uzbek language	Jan 30, 2023	Language ModelingLanguage Modelling	—Unverified
VAKTA-SETU: A Speech-to-Speech Machine Translation Service in Select Indic Languages	May 21, 2023	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis	Jul 4, 2024	Accented Speech RecognitionAutomatic Speech Recognition	—Unverified
Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model	Jun 6, 2024	Language ModelingLanguage Modelling	—Unverified
Improving homograph disambiguation with supervised machine learning	May 1, 2018	BIG-bench Machine LearningSpeech Synthesis	—Unverified
Incremental Disentanglement for Environment-Aware Zero-Shot Text-to-Speech Synthesis	Dec 22, 2024	DecoderDisentanglement	—Unverified
Incremental Machine Speech Chain Towards Enabling Listening while Speaking in Real-time	Nov 4, 2020	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
Incremental Text-to-Speech Synthesis with Prefix-to-Prefix Framework	Nov 7, 2019	SentenceSpeech Synthesis	—Unverified
Individuality-Preserving Spectrum Modification for Articulation Disorders Using Phone Selective Synthesis	Sep 1, 2015	Speech SynthesisText-To-Speech Synthesis	—Unverified
VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers	Jun 8, 2024	Speech Synthesistext-to-speech	—Unverified
Investigating Inter- and Intra-speaker Voice Conversion using Audiobooks	Jun 1, 2022	Speech Synthesistext-to-speech	—Unverified
Investigation of Japanese PnG BERT language model in text-to-speech synthesis for pitch accent language	Dec 16, 2022	Language ModelingLanguage Modelling	—Unverified
Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis	May 20, 2020	Speech Synthesistext-to-speech	—Unverified
ASVspoof 5: Design, Collection and Validation of Resources for Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech	Feb 13, 2025	Adversarial AttackAdversarial Attack Detection	—Unverified
VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment	Jun 12, 2024	QuantizationSpeech Synthesis	—Unverified
VARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based on Very Deep VAE with Residual Attention	Feb 12, 2021	Speech Synthesistext-to-speech	—Unverified
Large tagset labeling using Feed Forward Neural Networks. Case study on Romanian Language	Aug 1, 2013	Machine TranslationPart-Of-Speech Tagging	—Unverified
AS-Speech: Adaptive Style For Speech Synthesis	Sep 9, 2024	RhythmSpeech Synthesis	—Unverified
LDC Forced Aligner	May 1, 2012	SentenceSpeech Recognition	—Unverified
Variations prosodiques en synth\`ese par s\'election d'unit\'es: l'exemple des phrases interrogatives (Prosodic variations in unit-based speech synthesis: the example of interrogative sentences) [in French]	Jun 1, 2012	Speech SynthesisText-To-Speech Synthesis	—Unverified
Learning Sentiment Lexicons in Spanish	May 1, 2012	Opinion MiningQuestion Answering	—Unverified
Leveraging supplemental representations for sequential transduction	Jun 1, 2012	Speech SynthesisText-To-Speech Synthesis	—Unverified
Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications	May 12, 2025	Speech Synthesistext-to-speech	—Unverified
A Review of Deep Learning Techniques for Speech Processing	Apr 30, 2023	Automatic Speech RecognitionDeep Learning	—Unverified
Listening while Speaking: Speech Chain by Deep Learning	Jul 16, 2017	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
Location, Location: Enhancing the Evaluation of Text-to-Speech Synthesis Using the Rapid Prosody Transcription Paradigm	Jul 6, 2021	Speech Synthesistext-to-speech	—Unverified
Low-Latency Incremental Text-to-Speech Synthesis with Distilled Context Prediction Network	Sep 22, 2021	Knowledge DistillationLanguage Modeling	—Unverified
Low-Resource Text-to-Speech Synthesis Using Noise-Augmented Training of ForwardTacotron	Jan 10, 2025	Speech Synthesistext-to-speech	—Unverified
M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis	May 3, 2023	Speech Synthesistext-to-speech	—Unverified
Machine Speech Chain with One-shot Speaker Adaptation	Mar 28, 2018	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
Vers une annotation automatique de corpus audio pour la synth\`ese de parole (Towards Fully Automatic Annotation of Audio Books for Text-To-Speech (TTS) Synthesis) [in French]	Jun 1, 2012	Speech Synthesistext-to-speech	—Unverified
Applying Syntaxx2013Prosody Mapping Hypothesis and Prosodic Well-Formedness Constraints to Neural Sequence-to-Sequence Speech Synthesis	Mar 29, 2022	Speech Synthesistext-to-speech	—Unverified
Meta Learning Text-to-Speech Synthesis in over 7000 Languages	Jun 10, 2024	Meta-LearningSpeech Synthesis	—Unverified
Minimally Supervised Number Normalization	Jan 1, 2016	speech-recognitionSpeech Recognition	—Unverified
Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech	Oct 27, 2022	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis	Dec 17, 2023	Speech SynthesisStyle Transfer	—Unverified
Accent conversion using discrete units with parallel data synthesized from controllable accented TTS	Sep 30, 2024	Data AugmentationSpeech Synthesis	—Unverified
Modular Meta-Learning with Shrinkage	Sep 12, 2019	Image ClassificationMeta-Learning	—Unverified
Applying Automated Machine Translation to Educational Video Courses	Jan 9, 2023	Machine TranslationSpeech Synthesis	—Unverified
MParrotTTS: Multilingual Multi-speaker Text to Speech Synthesis in Low Resource Setting	May 19, 2023	Speech Synthesistext-to-speech	—Unverified
Voice Cloning: a Multi-Speaker Text-to-Speech Synthesis Approach based on Transfer Learning	Feb 10, 2021	Speech Synthesistext-to-speech	—Unverified
Multi-Scale Accent Modeling and Disentangling for Multi-Speaker Multi-Accent Text-to-Speech Synthesis	Jun 16, 2024	DisentanglementSpeech Synthesis	—Unverified
Voice Conversion by Cascading Automatic Speech Recognition and Text-to-Speech Synthesis with Prosody Transfer	Sep 3, 2020	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified

Show:10 25 50

← PrevPage 4 of 7Next →

All datasets LJSpeech 20000 utterances CMUDict 0.7b HUI speech corpus Thorsten voice 21.02 neutral Trinity Speech-Gesture Dataset

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	NaturalSpeech	Audio Quality MOS	4.56	—	Unverified
2	VITS	Audio Quality MOS	4.43	—	Unverified
3	Grad-TTS + HiFiGAN (1000 steps)	Audio Quality MOS	4.37	—	Unverified
4	FastSpeech 2 + HiFiGAN	Audio Quality MOS	4.34	—	Unverified
5	Glow-TTS + HiFiGAN	Audio Quality MOS	4.34	—	Unverified
6	FastSpeech 2 + HiFiGAN	Audio Quality MOS	4.32	—	Unverified
7	FastDiff (4 steps)	Audio Quality MOS	4.28	—	Unverified
8	FastDiff-TTS	Audio Quality MOS	4.03	—	Unverified
9	Transformer TTS (Mel + WaveGlow)	Audio Quality MOS	3.88	—	Unverified
10	FastSpeech (Mel + WaveGlow)	Audio Quality MOS	3.84	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Mia	10-keyword Speech Commands dataset	16	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Token-Level Ensemble Distillation	Phoneme Error Rate	4.6	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	3.74	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	3.49	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Match-TTSG	MOS	3.7	—	Unverified