SOTAVerified

Text-To-Speech Synthesis

Text-To-Speech Synthesis is a machine learning task that involves converting written text into spoken words. The goal is to generate synthetic speech that sounds natural and resembles human speech as closely as possible.

Papers

Showing 251275 of 332 papers

TitleStatusHype
Russian Stress Prediction using Maximum Entropy Ranking0
Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling0
Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model0
S2ST-Omni: An Efficient and Scalable Multilingual Speech-to-Speech Translation Framework via Seamless Speech-Text Alignment and Streaming Speech Generation0
SALF-MOS: Speaker Agnostic Latent Features Downsampled for MOS Prediction0
SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis0
Alternate Endings: Improving Prosody for Incremental Neural TTS with Predicted Future Text Input0
Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis0
Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation0
Listening while Speaking and Visualizing: Improving ASR through Multimodal Chain0
ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models0
Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis0
Simultaneous Speech-to-Speech Translation System with Neural Incremental ASR, MT, and TTS0
SLMGAN: Exploiting Speech Language Model Representations for Unsupervised Zero-Shot Voice Conversion in GANs0
SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis0
Speaker-independent raw waveform model for glottal excitation0
Speaker verification-derived loss and data augmentation for DNN-based multispeaker speech synthesis0
Aligning Opinions: Cross-Lingual Opinion Mining with Dependencies0
Speaking style adaptation in Text-To-Speech synthesis using Sequence-to-sequence models with attention0
Speech Bandwidth Expansion Via High Fidelity Generative Adversarial Networks0
Speech denoising by parametric resynthesis0
WebWOZ: A Platform for Designing and Conducting Web-based Wizard of Oz Experiments0
Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models0
A distributed cloud-based dialog system for conversational application development0
Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting0
Show:102550
← PrevPage 11 of 14Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1NaturalSpeechAudio Quality MOS4.56Unverified
2VITSAudio Quality MOS4.43Unverified
3Grad-TTS + HiFiGAN (1000 steps)Audio Quality MOS4.37Unverified
4FastSpeech 2 + HiFiGANAudio Quality MOS4.34Unverified
5Glow-TTS + HiFiGANAudio Quality MOS4.34Unverified
6FastSpeech 2 + HiFiGANAudio Quality MOS4.32Unverified
7FastDiff (4 steps)Audio Quality MOS4.28Unverified
8FastDiff-TTSAudio Quality MOS4.03Unverified
9Transformer TTS (Mel + WaveGlow)Audio Quality MOS3.88Unverified
10FastSpeech (Mel + WaveGlow)Audio Quality MOS3.84Unverified
#ModelMetricClaimedVerifiedStatus
1Mia10-keyword Speech Commands dataset16Unverified
#ModelMetricClaimedVerifiedStatus
1Token-Level Ensemble DistillationPhoneme Error Rate4.6Unverified
#ModelMetricClaimedVerifiedStatus
1Tacotron 2Mean Opinion Score3.74Unverified
#ModelMetricClaimedVerifiedStatus
1Tacotron 2Mean Opinion Score3.49Unverified
#ModelMetricClaimedVerifiedStatus
1Match-TTSGMOS3.7Unverified