Text-To-Speech Synthesis

Text-To-Speech Synthesis is a machine learning task that involves converting written text into spoken words. The goal is to generate synthetic speech that sounds natural and resembles human speech as closely as possible.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–25 of 332 papers

Title	Date	Tasks	Status	Hype
ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching	Jun 16, 2025	DecoderSpeech Synthesis	CodeCode Available	4
S2ST-Omni: An Efficient and Scalable Multilingual Speech-to-Speech Translation Framework via Seamless Speech-Text Alignment and Streaming Speech Generation	Jun 11, 2025	Reading ComprehensionSpeech Synthesis	—Unverified	0
A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions	Jun 4, 2025	Data AugmentationDiversity	—Unverified	0
CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech	Jun 3, 2025	Speech Synthesistext-to-speech	—Unverified	0
SALF-MOS: Speaker Agnostic Latent Features Downsampled for MOS Prediction	Jun 2, 2025	Speech Synthesistext-to-speech	—Unverified	0
Chain-of-Thought Training for Open E2E Spoken Dialogue Systems	May 31, 2025	Language ModelingLanguage Modelling	—Unverified	0
Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling	May 26, 2025	SentenceSpeech Synthesis	—Unverified	0
Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis	May 25, 2025	Speech Synthesistext-to-speech	—Unverified	0
Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models	May 21, 2025	Bayesian OptimizationSpeech Synthesis	CodeCode Available	1
FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation	May 20, 2025	Dataset GenerationSpeech Synthesis	—Unverified	0
Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis	May 18, 2025	Speech Synthesistext-to-speech	—Unverified	0
Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications	May 12, 2025	Speech Synthesistext-to-speech	—Unverified	0
A Multi-Agent Framework for Automated Qinqiang Opera Script Generation Using Large Language Models	Apr 22, 2025	cross-modal alignmentScript Generation	—Unverified	0
AutoStyle-TTS: Retrieval-Augmented Generation based Automatic Style Matching Text-to-Speech Synthesis	Apr 14, 2025	RAGRetrieval-augmented Generation	—Unverified	0
Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis	Apr 14, 2025	Language ModelingLanguage Modelling	—Unverified	0
MoonCast: High-Quality Zero-Shot Podcast Generation	Mar 18, 2025	Speech Synthesistext-to-speech	CodeCode Available	3
ASVspoof 5: Design, Collection and Validation of Resources for Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech	Feb 13, 2025	Adversarial AttackAdversarial Attack Detection	—Unverified	0
PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control	Jan 10, 2025	Speech Synthesistext-to-speech	—Unverified	0
Low-Resource Text-to-Speech Synthesis Using Noise-Augmented Training of ForwardTacotron	Jan 10, 2025	Speech Synthesistext-to-speech	—Unverified	0
Probing Speaker-specific Features in Speaker Representations	Jan 9, 2025	Self-Supervised LearningSpeaker Verification	—Unverified	0
Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting	Dec 28, 2024	Speech Synthesistext-to-speech	—Unverified	0
Incremental Disentanglement for Environment-Aware Zero-Shot Text-to-Speech Synthesis	Dec 22, 2024	DecoderDisentanglement	—Unverified	0
ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis	Dec 16, 2024	Speech Synthesistext-to-speech	—Unverified	0
Efficient Generative Modeling with Residual Vector Quantization-Based Tokens	Dec 13, 2024	Conditional Image GenerationImage Generation	—Unverified	0
Multimodal Latent Language Modeling with Next-Token Diffusion	Dec 11, 2024	Image GenerationLanguage Modeling	CodeCode Available	0

Show:10 25 50

← PrevPage 1 of 14Next →

All datasets LJSpeech 20000 utterances CMUDict 0.7b HUI speech corpus Thorsten voice 21.02 neutral Trinity Speech-Gesture Dataset

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	NaturalSpeech	Audio Quality MOS	4.56	—	Unverified
2	VITS	Audio Quality MOS	4.43	—	Unverified
3	Grad-TTS + HiFiGAN (1000 steps)	Audio Quality MOS	4.37	—	Unverified
4	FastSpeech 2 + HiFiGAN	Audio Quality MOS	4.34	—	Unverified
5	Glow-TTS + HiFiGAN	Audio Quality MOS	4.34	—	Unverified
6	FastSpeech 2 + HiFiGAN	Audio Quality MOS	4.32	—	Unverified
7	FastDiff (4 steps)	Audio Quality MOS	4.28	—	Unverified
8	FastDiff-TTS	Audio Quality MOS	4.03	—	Unverified
9	Transformer TTS (Mel + WaveGlow)	Audio Quality MOS	3.88	—	Unverified
10	FastSpeech (Mel + WaveGlow)	Audio Quality MOS	3.84	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Mia	10-keyword Speech Commands dataset	16	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Token-Level Ensemble Distillation	Phoneme Error Rate	4.6	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	3.74	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	3.49	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Match-TTSG	MOS	3.7	—	Unverified