Text-To-Speech Synthesis

Text-To-Speech Synthesis is a machine learning task that involves converting written text into spoken words. The goal is to generate synthetic speech that sounds natural and resembles human speech as closely as possible.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 101–150 of 332 papers

Title	Date	Tasks	Status
DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis	Oct 17, 2024	Speech Synthesistext-to-speech	—Unverified
Efficient training strategies for natural sounding speech synthesis and speaker adaptation based on FastPitch	Oct 9, 2024	Speech Synthesistext-to-speech	—Unverified
Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS	Oct 9, 2024	DiversitySpeech Synthesis	—Unverified
HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis	Oct 6, 2024	Language ModelingLanguage Modelling	—Unverified
Generative Semantic Communication for Text-to-Speech Synthesis	Oct 4, 2024	QuantizationSemantic Communication	—Unverified
Accent conversion using discrete units with parallel data synthesized from controllable accented TTS	Sep 30, 2024	Data AugmentationSpeech Synthesis	—Unverified
StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis	Sep 24, 2024	Speech Synthesistext-to-speech	—Unverified
StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion	Sep 16, 2024	Speech Synthesistext-to-speech	—Unverified
Text-To-Speech Synthesis In The Wild	Sep 13, 2024	BenchmarkingSpeaker Recognition	—Unverified
Full-text Error Correction for Chinese Speech Recognition with Large Language Model	Sep 12, 2024	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
What happens to diffusion model likelihood when your model is conditional?	Sep 10, 2024	domain classificationmodel	—Unverified
AS-Speech: Adaptive Style For Speech Synthesis	Sep 9, 2024	RhythmSpeech Synthesis	—Unverified
Speech Bandwidth Expansion Via High Fidelity Generative Adversarial Networks	Jul 26, 2024	Generative Adversarial NetworkSpeech Enhancement	—Unverified
Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models	Jul 18, 2024	Language ModelingLanguage Modelling	—Unverified
Autoregressive Speech Synthesis without Vector Quantization	Jul 11, 2024	Audio CompressionDiversity	—Unverified
Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis	Jul 4, 2024	Accented Speech RecognitionAutomatic Speech Recognition	—Unverified
Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization	Jul 2, 2024	Inference OptimizationSpeech Synthesis	—Unverified
FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis	Jun 30, 2024	CPUDecoder	—Unverified
Multi-Scale Accent Modeling and Disentangling for Multi-Speaker Multi-Accent Text-to-Speech Synthesis	Jun 16, 2024	DisentanglementSpeech Synthesis	—Unverified
VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment	Jun 12, 2024	QuantizationSpeech Synthesis	—Unverified
Meta Learning Text-to-Speech Synthesis in over 7000 Languages	Jun 10, 2024	Meta-LearningSpeech Synthesis	—Unverified
Autoregressive Diffusion Transformer for Text-to-Speech Synthesis	Jun 8, 2024	Audio GenerationDecoder	—Unverified
VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers	Jun 8, 2024	Speech Synthesistext-to-speech	—Unverified
Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model	Jun 6, 2024	Language ModelingLanguage Modelling	—Unverified
Style Mixture of Experts for Expressive Text-To-Speech Synthesis	Jun 5, 2024	Mixture-of-ExpertsSpeech Synthesis	—Unverified
Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis	Jun 4, 2024	In-Context LearningLanguage Modeling	—Unverified
Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback	Jun 2, 2024	Speech Synthesistext-to-speech	—Unverified
DLPO: Diffusion Model Loss-Guided Reinforcement Learning for Fine-Tuning Text-to-Speech Diffusion Models	May 23, 2024	Image Generationreinforcement-learning	—Unverified
Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model	May 16, 2024	HallucinationLanguage Modeling	—Unverified
RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis	Apr 4, 2024	Language ModelingLanguage Modelling	—Unverified
PSCodec: A Series of High-Fidelity Low-bitrate Neural Speech Codecs Leveraging Prompt Encoders	Apr 3, 2024	Representation LearningSpeaker Verification	—Unverified
Bayesian Parameter-Efficient Fine-Tuning for Overcoming Catastrophic Forgetting	Feb 19, 2024	Language ModelingLanguage Modelling	CodeCode Available
Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters	Jan 10, 2024	Self-Supervised LearningSpeech Enhancement	—Unverified
Boosting Large Language Model for Speech Synthesis: An Empirical Study	Dec 30, 2023	Language ModelingLanguage Modelling	—Unverified
Normalization of Lithuanian Text Using Regular Expressions	Dec 29, 2023	Speech SynthesisText Normalization	—Unverified
MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis	Dec 17, 2023	Speech SynthesisStyle Transfer	—Unverified
An Experimental Study: Assessing the Combined Framework of WavLM and BEST-RQ for Text-to-Speech Synthesis	Dec 8, 2023	BenchmarkingQuantization	—Unverified
Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis	Dec 6, 2023	Speech Synthesistext-to-speech	—Unverified
Code-Mixed Text to Speech Synthesis under Low-Resource Constraints	Dec 2, 2023	Speech Synthesistext-to-speech	—Unverified
Guided Flows for Generative Modeling and Decision Making	Nov 22, 2023	Conditional Image GenerationDecision Making	—Unverified
Generative Pre-training for Speech with Flow Matching	Oct 25, 2023	Speech EnhancementSpeech Synthesis	—Unverified
Back Transcription as a Method for Evaluating Robustness of Natural Language Understanding Models to Speech Recognition Errors	Oct 25, 2023	en-US domain classificationen-US Intent Classification	CodeCode Available
Attentive Multi-Layer Perceptron for Non-autoregressive Generation	Oct 14, 2023	Machine TranslationSpeech Synthesis	CodeCode Available
Unified speech and gesture synthesis using flow matching	Oct 8, 2023	Audio SynthesisMotion Synthesis	—Unverified
The VoiceMOS Challenge 2023: Zero-shot Subjective Speech Quality Prediction for Multiple Domains	Oct 4, 2023	Speech Synthesistext-to-speech	—Unverified
DurIAN-E: Duration Informed Attention Network For Expressive Text-to-Speech Synthesis	Sep 22, 2023	DenoisingSpeech Synthesis	—Unverified
The FruitShell French synthesis system at the Blizzard 2023 Challenge	Sep 1, 2023	Data AugmentationSpeech Synthesis	—Unverified
Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis	Aug 31, 2023	Expressive Speech SynthesisSentence	—Unverified
SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis	Aug 2, 2023	DecoderSelf-Supervised Learning	—Unverified
Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech	Jul 31, 2023	Acoustic ModellingSpeech Synthesis	—Unverified

Show:10 25 50

← PrevPage 3 of 7Next →

All datasets LJSpeech 20000 utterances CMUDict 0.7b HUI speech corpus Thorsten voice 21.02 neutral Trinity Speech-Gesture Dataset

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	NaturalSpeech	Audio Quality MOS	4.56	—	Unverified
2	VITS	Audio Quality MOS	4.43	—	Unverified
3	Grad-TTS + HiFiGAN (1000 steps)	Audio Quality MOS	4.37	—	Unverified
4	FastSpeech 2 + HiFiGAN	Audio Quality MOS	4.34	—	Unverified
5	Glow-TTS + HiFiGAN	Audio Quality MOS	4.34	—	Unverified
6	FastSpeech 2 + HiFiGAN	Audio Quality MOS	4.32	—	Unverified
7	FastDiff (4 steps)	Audio Quality MOS	4.28	—	Unverified
8	FastDiff-TTS	Audio Quality MOS	4.03	—	Unverified
9	Transformer TTS (Mel + WaveGlow)	Audio Quality MOS	3.88	—	Unverified
10	FastSpeech (Mel + WaveGlow)	Audio Quality MOS	3.84	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Mia	10-keyword Speech Commands dataset	16	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Token-Level Ensemble Distillation	Phoneme Error Rate	4.6	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	3.74	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	3.49	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Match-TTSG	MOS	3.7	—	Unverified