SOTAVerified

Text-To-Speech Synthesis

Text-To-Speech Synthesis is a machine learning task that involves converting written text into spoken words. The goal is to generate synthetic speech that sounds natural and resembles human speech as closely as possible.

Papers

Showing 201250 of 332 papers

TitleStatusHype
Probing Speaker-specific Features in Speaker Representations0
A Multi-Agent Framework for Automated Qinqiang Opera Script Generation Using Large Language Models0
PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control0
PSCodec: A Series of High-Fidelity Low-bitrate Neural Speech Codecs Leveraging Prompt Encoders0
ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis0
Prosody-TTS: An end-to-end speech synthesis system with prosody control0
Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis0
Punjabi Text-To-Speech Synthesis System0
Wasserstein GAN and Waveform Loss-based Acoustic Model Training for Multi-speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder0
Waveform generation for text-to-speech synthesis using pitch-synchronous multi-scale generative adversarial networks0
RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis0
Real-time Incremental Speech-to-Speech Translation of Dialogs0
ReCAB-VAE: Gumbel-Softmax Variational Inference Based on Analytic Divergence0
Refer-iTTS: A System for Referring in Spoken Installments to Objects in Real-World Images0
Reinforcement Learning for Emotional Text-to-Speech Synthesis with Improved Emotion Discriminability0
DLPO: Diffusion Model Loss-Guided Reinforcement Learning for Fine-Tuning Text-to-Speech Diffusion Models0
ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement0
ReVISE: Self-Supervised Speech Resynthesis With Visual Input for Universal and Generalized Speech Regeneration0
Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis0
R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS0
Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization0
RSS-TOBI - A Prosodically Enhanced Romanian Speech Corpus0
Russian Stress Prediction using Maximum Entropy Ranking0
Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling0
Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model0
S2ST-Omni: An Efficient and Scalable Multilingual Speech-to-Speech Translation Framework via Seamless Speech-Text Alignment and Streaming Speech Generation0
SALF-MOS: Speaker Agnostic Latent Features Downsampled for MOS Prediction0
SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis0
Alternate Endings: Improving Prosody for Incremental Neural TTS with Predicted Future Text Input0
Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis0
Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation0
Listening while Speaking and Visualizing: Improving ASR through Multimodal Chain0
ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models0
Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis0
Simultaneous Speech-to-Speech Translation System with Neural Incremental ASR, MT, and TTS0
SLMGAN: Exploiting Speech Language Model Representations for Unsupervised Zero-Shot Voice Conversion in GANs0
SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis0
Speaker-independent raw waveform model for glottal excitation0
Speaker verification-derived loss and data augmentation for DNN-based multispeaker speech synthesis0
Aligning Opinions: Cross-Lingual Opinion Mining with Dependencies0
Speaking style adaptation in Text-To-Speech synthesis using Sequence-to-sequence models with attention0
Speech Bandwidth Expansion Via High Fidelity Generative Adversarial Networks0
Speech denoising by parametric resynthesis0
WebWOZ: A Platform for Designing and Conducting Web-based Wizard of Oz Experiments0
Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models0
A distributed cloud-based dialog system for conversational application development0
Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting0
Adaptive Parser-Centric Text Normalization0
StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis0
Style Mixture of Experts for Expressive Text-To-Speech Synthesis0
Show:102550
← PrevPage 5 of 7Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1NaturalSpeechAudio Quality MOS4.56Unverified
2VITSAudio Quality MOS4.43Unverified
3Grad-TTS + HiFiGAN (1000 steps)Audio Quality MOS4.37Unverified
4FastSpeech 2 + HiFiGANAudio Quality MOS4.34Unverified
5Glow-TTS + HiFiGANAudio Quality MOS4.34Unverified
6FastSpeech 2 + HiFiGANAudio Quality MOS4.32Unverified
7FastDiff (4 steps)Audio Quality MOS4.28Unverified
8FastDiff-TTSAudio Quality MOS4.03Unverified
9Transformer TTS (Mel + WaveGlow)Audio Quality MOS3.88Unverified
10FastSpeech (Mel + WaveGlow)Audio Quality MOS3.84Unverified
#ModelMetricClaimedVerifiedStatus
1Mia10-keyword Speech Commands dataset16Unverified
#ModelMetricClaimedVerifiedStatus
1Token-Level Ensemble DistillationPhoneme Error Rate4.6Unverified
#ModelMetricClaimedVerifiedStatus
1Tacotron 2Mean Opinion Score3.74Unverified
#ModelMetricClaimedVerifiedStatus
1Tacotron 2Mean Opinion Score3.49Unverified
#ModelMetricClaimedVerifiedStatus
1Match-TTSGMOS3.7Unverified