Text-To-Speech Synthesis

Text-To-Speech Synthesis is a machine learning task that involves converting written text into spoken words. The goal is to generate synthetic speech that sounds natural and resembles human speech as closely as possible.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–50 of 332 papers

Title	Date	Tasks	Status	Hype
ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching	Jun 16, 2025	DecoderSpeech Synthesis	CodeCode Available	4
S2ST-Omni: An Efficient and Scalable Multilingual Speech-to-Speech Translation Framework via Seamless Speech-Text Alignment and Streaming Speech Generation	Jun 11, 2025	Reading ComprehensionSpeech Synthesis	—Unverified	0
A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions	Jun 4, 2025	Data AugmentationDiversity	—Unverified	0
CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech	Jun 3, 2025	Speech Synthesistext-to-speech	—Unverified	0
SALF-MOS: Speaker Agnostic Latent Features Downsampled for MOS Prediction	Jun 2, 2025	Speech Synthesistext-to-speech	—Unverified	0
Chain-of-Thought Training for Open E2E Spoken Dialogue Systems	May 31, 2025	Language ModelingLanguage Modelling	—Unverified	0
Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling	May 26, 2025	SentenceSpeech Synthesis	—Unverified	0
Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis	May 25, 2025	Speech Synthesistext-to-speech	—Unverified	0
Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models	May 21, 2025	Bayesian OptimizationSpeech Synthesis	CodeCode Available	1
FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation	May 20, 2025	Dataset GenerationSpeech Synthesis	—Unverified	0
Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis	May 18, 2025	Speech Synthesistext-to-speech	—Unverified	0
Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications	May 12, 2025	Speech Synthesistext-to-speech	—Unverified	0
A Multi-Agent Framework for Automated Qinqiang Opera Script Generation Using Large Language Models	Apr 22, 2025	cross-modal alignmentScript Generation	—Unverified	0
AutoStyle-TTS: Retrieval-Augmented Generation based Automatic Style Matching Text-to-Speech Synthesis	Apr 14, 2025	RAGRetrieval-augmented Generation	—Unverified	0
Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis	Apr 14, 2025	Language ModelingLanguage Modelling	—Unverified	0
MoonCast: High-Quality Zero-Shot Podcast Generation	Mar 18, 2025	Speech Synthesistext-to-speech	CodeCode Available	3
ASVspoof 5: Design, Collection and Validation of Resources for Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech	Feb 13, 2025	Adversarial AttackAdversarial Attack Detection	—Unverified	0
PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control	Jan 10, 2025	Speech Synthesistext-to-speech	—Unverified	0
Low-Resource Text-to-Speech Synthesis Using Noise-Augmented Training of ForwardTacotron	Jan 10, 2025	Speech Synthesistext-to-speech	—Unverified	0
Probing Speaker-specific Features in Speaker Representations	Jan 9, 2025	Self-Supervised LearningSpeaker Verification	—Unverified	0
Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting	Dec 28, 2024	Speech Synthesistext-to-speech	—Unverified	0
Incremental Disentanglement for Environment-Aware Zero-Shot Text-to-Speech Synthesis	Dec 22, 2024	DecoderDisentanglement	—Unverified	0
ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis	Dec 16, 2024	Speech Synthesistext-to-speech	—Unverified	0
Efficient Generative Modeling with Residual Vector Quantization-Based Tokens	Dec 13, 2024	Conditional Image GenerationImage Generation	—Unverified	0
Multimodal Latent Language Modeling with Next-Token Diffusion	Dec 11, 2024	Image GenerationLanguage Modeling	CodeCode Available	0
Debatts: Zero-Shot Debating Text-to-Speech Synthesis	Nov 10, 2024	Speech Synthesistext-to-speech	—Unverified	0
Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis	Oct 30, 2024	Speech Synthesistext-to-speech	CodeCode Available	2
A Unified Framework for Collecting Text-to-Speech Synthesis Datasets for 22 Indian Languages	Oct 18, 2024	Speech Synthesistext-to-speech	—Unverified	0
DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis	Oct 17, 2024	Speech Synthesistext-to-speech	—Unverified	0
Efficient training strategies for natural sounding speech synthesis and speaker adaptation based on FastPitch	Oct 9, 2024	Speech Synthesistext-to-speech	—Unverified	0
Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS	Oct 9, 2024	DiversitySpeech Synthesis	—Unverified	0
HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis	Oct 6, 2024	Language ModelingLanguage Modelling	—Unverified	0
Generative Semantic Communication for Text-to-Speech Synthesis	Oct 4, 2024	QuantizationSemantic Communication	—Unverified	0
Accent conversion using discrete units with parallel data synthesized from controllable accented TTS	Sep 30, 2024	Data AugmentationSpeech Synthesis	—Unverified	0
StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis	Sep 24, 2024	Speech Synthesistext-to-speech	—Unverified	0
StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion	Sep 16, 2024	Speech Synthesistext-to-speech	—Unverified	0
Text-To-Speech Synthesis In The Wild	Sep 13, 2024	BenchmarkingSpeaker Recognition	—Unverified	0
Full-text Error Correction for Chinese Speech Recognition with Large Language Model	Sep 12, 2024	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified	0
SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis	Sep 11, 2024	DecoderSpeech Synthesis	CodeCode Available	2
What happens to diffusion model likelihood when your model is conditional?	Sep 10, 2024	domain classificationmodel	—Unverified	0
AS-Speech: Adaptive Style For Speech Synthesis	Sep 9, 2024	RhythmSpeech Synthesis	—Unverified	0
Sample-Efficient Diffusion for Text-To-Speech Synthesis	Sep 1, 2024	Language ModelingLanguage Modelling	CodeCode Available	2
Speech Bandwidth Expansion Via High Fidelity Generative Adversarial Networks	Jul 26, 2024	Generative Adversarial NetworkSpeech Enhancement	—Unverified	0
Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models	Jul 18, 2024	Language ModelingLanguage Modelling	—Unverified	0
Autoregressive Speech Synthesis without Vector Quantization	Jul 11, 2024	Audio CompressionDiversity	—Unverified	0
Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis	Jul 4, 2024	Accented Speech RecognitionAutomatic Speech Recognition	—Unverified	0
Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization	Jul 2, 2024	Inference OptimizationSpeech Synthesis	—Unverified	0
FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis	Jun 30, 2024	CPUDecoder	—Unverified	0
Multi-Scale Accent Modeling and Disentangling for Multi-Speaker Multi-Accent Text-to-Speech Synthesis	Jun 16, 2024	DisentanglementSpeech Synthesis	—Unverified	0
VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment	Jun 12, 2024	QuantizationSpeech Synthesis	—Unverified	0

Show:10 25 50

← PrevPage 1 of 7Next →

All datasets LJSpeech 20000 utterances CMUDict 0.7b HUI speech corpus Thorsten voice 21.02 neutral Trinity Speech-Gesture Dataset

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	NaturalSpeech	Audio Quality MOS	4.56	—	Unverified
2	VITS	Audio Quality MOS	4.43	—	Unverified
3	Grad-TTS + HiFiGAN (1000 steps)	Audio Quality MOS	4.37	—	Unverified
4	FastSpeech 2 + HiFiGAN	Audio Quality MOS	4.34	—	Unverified
5	Glow-TTS + HiFiGAN	Audio Quality MOS	4.34	—	Unverified
6	FastSpeech 2 + HiFiGAN	Audio Quality MOS	4.32	—	Unverified
7	FastDiff (4 steps)	Audio Quality MOS	4.28	—	Unverified
8	FastDiff-TTS	Audio Quality MOS	4.03	—	Unverified
9	Transformer TTS (Mel + WaveGlow)	Audio Quality MOS	3.88	—	Unverified
10	FastSpeech (Mel + WaveGlow)	Audio Quality MOS	3.84	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Mia	10-keyword Speech Commands dataset	16	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Token-Level Ensemble Distillation	Phoneme Error Rate	4.6	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	3.74	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tacotron 2	Mean Opinion Score	3.49	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Match-TTSG	MOS	3.7	—	Unverified