Text-To-Speech Synthesis
Text-To-Speech Synthesis is a machine learning task that involves converting written text into spoken words. The goal is to generate synthetic speech that sounds natural and resembles human speech as closely as possible.
Papers
Showing 1–10 of 332 papers
All datasetsLJSpeech20000 utterancesCMUDict 0.7bHUI speech corpusThorsten voice 21.02 neutralTrinity Speech-Gesture Dataset
Benchmark Results
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | NaturalSpeech | Audio Quality MOS | 4.56 | — | Unverified |
| 2 | VITS | Audio Quality MOS | 4.43 | — | Unverified |
| 3 | Grad-TTS + HiFiGAN (1000 steps) | Audio Quality MOS | 4.37 | — | Unverified |
| 4 | FastSpeech 2 + HiFiGAN | Audio Quality MOS | 4.34 | — | Unverified |
| 5 | Glow-TTS + HiFiGAN | Audio Quality MOS | 4.34 | — | Unverified |
| 6 | FastSpeech 2 + HiFiGAN | Audio Quality MOS | 4.32 | — | Unverified |
| 7 | FastDiff (4 steps) | Audio Quality MOS | 4.28 | — | Unverified |
| 8 | FastDiff-TTS | Audio Quality MOS | 4.03 | — | Unverified |
| 9 | Transformer TTS (Mel + WaveGlow) | Audio Quality MOS | 3.88 | — | Unverified |
| 10 | FastSpeech (Mel + WaveGlow) | Audio Quality MOS | 3.84 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Mia | 10-keyword Speech Commands dataset | 16 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Token-Level Ensemble Distillation | Phoneme Error Rate | 4.6 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Tacotron 2 | Mean Opinion Score | 3.74 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Tacotron 2 | Mean Opinion Score | 3.49 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Match-TTSG | MOS | 3.7 | — | Unverified |