Speech-to-Speech Translation
Speech-to-speech translation (S2ST) consists on translating speech from one language to speech in another language. This can be done with a cascade of automatic speech recognition (ASR), text-to-text machine translation (MT), and text-to-speech (TTS) synthesis sub-systems, which is text-centric. Recently, works on S2ST without relying on intermediate text representation is emerging.
Papers
Showing 1–10 of 117 papers
Benchmark Results
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Hokkien→En (Two-pass decoding) | ASR-BLEU (Dev) | 13.6 | — | Unverified |
| 2 | Hokkien→En (Two-stage) | ASR-BLEU (Dev) | 12.5 | — | Unverified |
| 3 | Hokkien→En (Three-stage) | ASR-BLEU (Dev) | 12.5 | — | Unverified |
| 4 | Hokkien→En (Single-pass decoding) | ASR-BLEU (Dev) | 8.8 | — | Unverified |
| 5 | En→Hokkien (Two-pass decoding) | ASR-BLEU (Dev) | 7.8 | — | Unverified |
| 6 | En→Hokkien (Three-stage) | ASR-BLEU (Dev) | 7.5 | — | Unverified |
| 7 | En→Hokkien (Two-stage) | ASR-BLEU (Dev) | 7.1 | — | Unverified |
| 8 | En→Hokkien (Single-pass decoding) | ASR-BLEU (Dev) | 6.6 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | GenTranslateV2 | ASR-BLEU | 32.3 | — | Unverified |
| 2 | GenTranslateV1 | ASR-BLEU | 30.1 | — | Unverified |
| 3 | SeamlessM4T LargeV2 | ASR-BLEU | 29.4 | — | Unverified |
| 4 | SeamlessM4T Large | ASR-BLEU | 25.8 | — | Unverified |
| 5 | AudioPaLM2 | ASR-BLEU | 24 | — | Unverified |
| 6 | WhisperV2 | ASR-BLEU | 23.5 | — | Unverified |
| 7 | SeamlessM4T Medium | ASR-BLEU | 20.4 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | SeamlessM4T Large | ASR-BLEU | 36.5 | — | Unverified |
| 2 | SeamlessM4T Medium | ASR-BLEU | 28.1 | — | Unverified |