Text-To-Speech Synthesis

Text-To-Speech Synthesis is a machine learning task that involves converting written text into spoken words. The goal is to generate synthetic speech that sounds natural and resembles human speech as closely as possible.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 151–200 of 332 papers

Title	Date	Tasks	Status
SLMGAN: Exploiting Speech Language Model Representations for Unsupervised Zero-Shot Voice Conversion in GANs	Jul 18, 2023	Generative Adversarial NetworkLanguage Modeling	—Unverified
High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units	Jun 29, 2023	Speech Synthesistext-to-speech	—Unverified
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale	Jun 23, 2023	In-Context LearningSpeech Synthesis	CodeCode Available
ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models	May 23, 2023	Speech Synthesistext-to-speech	—Unverified
VAKTA-SETU: A Speech-to-Speech Machine Translation Service in Select Indic Languages	May 21, 2023	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
MParrotTTS: Multilingual Multi-speaker Text to Speech Synthesis in Low Resource Setting	May 19, 2023	Speech Synthesistext-to-speech	—Unverified
A unified front-end framework for English text-to-speech synthesis	May 18, 2023	Speech SynthesisText Normalization	—Unverified
Accented Text-to-Speech Synthesis with Limited Data	May 8, 2023	Speech Synthesistext-to-speech	—Unverified
M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis	May 3, 2023	Speech Synthesistext-to-speech	—Unverified
A Review of Deep Learning Techniques for Speech Processing	Apr 30, 2023	Automatic Speech RecognitionDeep Learning	—Unverified
Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model	Apr 24, 2023	RhythmSelf-Supervised Learning	—Unverified
Text is All You Need: Personalizing ASR Models using Controllable Speech Synthesis	Mar 27, 2023	AllAutomatic Speech Recognition	—Unverified
A Survey on Audio Diffusion Models: Text To Speech Synthesis and Enhancement in Generative AI	Mar 23, 2023	Speech EnhancementSpeech Synthesis	—Unverified
Controllable Prosody Generation With Partial Inputs	Mar 14, 2023	Speech Synthesistext-to-speech	—Unverified
Do Prosody Transfer Models Transfer Prosody?	Mar 7, 2023	Speech Synthesistext-to-speech	—Unverified
ParrotTTS: Text-to-Speech synthesis by exploiting self-supervised representations	Mar 1, 2023	Self-Supervised LearningSpeech Synthesis	—Unverified
UzbekTagger: The rule-based POS tagger for Uzbek language	Jan 30, 2023	Language ModelingLanguage Modelling	—Unverified
Applying Automated Machine Translation to Educational Video Courses	Jan 9, 2023	Machine TranslationSpeech Synthesis	—Unverified
ReVISE: Self-Supervised Speech Resynthesis With Visual Input for Universal and Generalized Speech Regeneration	Jan 1, 2023	Audio-Visual Speech RecognitionResynthesis	—Unverified
ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement	Dec 21, 2022	Audio-Visual Speech RecognitionResynthesis	—Unverified
Text-to-speech synthesis based on latent variable conversion using diffusion probabilistic model and variational autoencoder	Dec 16, 2022	Representation LearningSpeech Synthesis	—Unverified
Investigation of Japanese PnG BERT language model in text-to-speech synthesis for pitch accent language	Dec 16, 2022	Language ModelingLanguage Modelling	—Unverified
Grad-StyleSpeech: Any-speaker Adaptive Text-to-Speech Synthesis with Diffusion Models	Nov 17, 2022	Speech Synthesistext-to-speech	—Unverified
Technology Pipeline for Large Scale Cross-Lingual Dubbing of Lecture Videos into Multiple Indian Languages	Nov 1, 2022	ChunkingRhythm	—Unverified
Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech	Oct 27, 2022	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era	Oct 6, 2022	Speech Synthesistext-to-speech	—Unverified
Controllable Accented Text-to-Speech Synthesis	Sep 22, 2022	Speech Synthesistext-to-speech	—Unverified
EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-To-Speech Models	Sep 22, 2022	Speech Synthesistext-to-speech	—Unverified
Mlphon: A Multifunctional Grapheme-Phoneme Conversion Tool Using Finite State Transducers	Sep 5, 2022	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	CodeCode Available
BERT, can HE predict contrastive focus? Predicting and controlling prominence in neural TTS using a language model	Jul 4, 2022	Language ModelingLanguage Modelling	—Unverified
R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS	Jun 30, 2022	DecoderGPU	—Unverified
Exploring Transfer Learning for Urdu Speech Synthesis	Jun 1, 2022	Speech Synthesistext-to-speech	—Unverified
BU-TTS: An Open-Source, Bilingual Welsh-English, Text-to-Speech Corpus	Jun 1, 2022	Speech Synthesistext-to-speech	—Unverified
Investigating Inter- and Intra-speaker Voice Conversion using Audiobooks	Jun 1, 2022	Speech Synthesistext-to-speech	—Unverified
Preparing an Endangered Language for the Digital Age: The Case of Judeo-Spanish	May 31, 2022	Machine TranslationSpeech Synthesis	CodeCode Available
ReCAB-VAE: Gumbel-Softmax Variational Inference Based on Analytic Divergence	May 9, 2022	Speech Synthesistext-to-speech	—Unverified
Systematic Inequalities in Language Technology Performance across the World’s Languages	May 1, 2022	Dependency ParsingMachine Translation	CodeCode Available
The PartialSpoof Database and Countermeasures for the Detection of Short Fake Speech Segments Embedded in an Utterance	Apr 11, 2022	Speaker VerificationSpeech Synthesis	—Unverified
SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis	Apr 6, 2022	Speech Synthesistext-to-speech	—Unverified
VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature	Apr 2, 2022	Speech Synthesistext-to-speech	—Unverified
Applying Syntaxx2013Prosody Mapping Hypothesis and Prosodic Well-Formedness Constraints to Neural Sequence-to-Sequence Speech Synthesis	Mar 29, 2022	Speech Synthesistext-to-speech	—Unverified
AutoTTS: End-to-End Text-to-Speech Synthesis through Differentiable Duration Modeling	Mar 21, 2022	DecoderSpeech Synthesis	—Unverified
ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis	Mar 20, 2022	Speaker VerificationSpeech Synthesis	CodeCode Available
Text-free non-parallel many-to-many voice conversion using normalising flows	Mar 15, 2022	Normalising FlowsSpeech Synthesis	—Unverified
Deep Performer: Score-to-Audio Music Performance Synthesis	Feb 12, 2022	DecoderSpeech Synthesis	—Unverified
Multi-Stage Deep Transfer Learning for EmIoT-enabled Human-Computer Interaction	Feb 3, 2022	Human-Object Interaction Detectiontext-to-speech	—Unverified
Transformer-based Models of Text Normalization for Speech Applications	Feb 1, 2022	SentenceSpeech Synthesis	—Unverified
Multi-speaker Multi-style Text-to-speech Synthesis With Single-speaker Single-style Training Data Scenarios	Dec 23, 2021	DiversitySpeech Synthesis	—Unverified
Guided-TTS: A Diffusion Model for Text-to-Speech via Classifier Guidance	Nov 23, 2021	speech-recognitionSpeech Recognition	—Unverified
Systematic Inequalities in Language Technology Performance across the World's Languages	Oct 13, 2021	Dependency ParsingMachine Translation	CodeCode Available

Show:10 25 50

← PrevPage 4 of 7Next →

All datasets LJSpeech 20000 utterances CMUDict 0.7b HUI speech corpus Thorsten voice 21.02 neutral Trinity Speech-Gesture Dataset

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	NaturalSpeech	Audio Quality MOS	4.56	—	Unverified
2	VITS	Audio Quality MOS	4.43	—	Unverified
3	Grad-TTS + HiFiGAN (1000 steps)	Audio Quality MOS	4.37	—	Unverified
4	FastSpeech 2 + HiFiGAN	Audio Quality MOS	4.34	—	Unverified
5	Glow-TTS + HiFiGAN	Audio Quality MOS	4.34	—	Unverified
6	FastSpeech 2 + HiFiGAN	Audio Quality MOS	4.32	—	Unverified
7	FastDiff (4 steps)	Audio Quality MOS	4.28	—	Unverified
8	FastDiff-TTS	Audio Quality MOS	4.03	—	Unverified
9	Transformer TTS (Mel + WaveGlow)	Audio Quality MOS	3.88	—	Unverified
10	FastSpeech (Mel + WaveGlow)	Audio Quality MOS	3.84	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Mia	10-keyword Speech Commands dataset	16	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Token-Level Ensemble Distillation	Phoneme Error Rate	4.6	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	3.74	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	3.49	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Match-TTSG	MOS	3.7	—	Unverified