SOTAVerified

Speech-to-Speech Translation

Speech-to-speech translation (S2ST) consists on translating speech from one language to speech in another language. This can be done with a cascade of automatic speech recognition (ASR), text-to-text machine translation (MT), and text-to-speech (TTS) synthesis sub-systems, which is text-centric. Recently, works on S2ST without relying on intermediate text representation is emerging.

Papers

Showing 51100 of 117 papers

TitleStatusHype
SimulTron: On-Device Simultaneous Speech to Speech Translation0
Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing0
SeamlessExpressiveLM: Speech Language Model for Expressive Speech-to-Speech Translation with Chain-of-Thought0
CrossVoice: Crosslingual Prosody Preserving Cascade-S2ST using Transfer Learning0
DiffNorm: Self-Supervised Normalization for Non-autoregressive Speech-to-speech TranslationCode0
MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation0
Direct Punjabi to English speech translation using discrete units0
A Case Study on Filtering for End-to-End Speech Translation0
TranSentence: Speech-to-speech Translation via Language-agnostic Sentence-level Speech Encoding without Language-parallel Data0
TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation0
DiffS2UT: A Semantic Preserving Diffusion Model for Textless Direct Speech-to-Speech Translation0
Enhancing expressivity transfer in textless speech-to-speech translation0
Direct Text to Speech Translation System using Acoustic Units0
Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer0
Multilingual Speech-to-Speech Translation into Multiple Target Languages0
Towards cross-language prosody transfer for dialogCode0
AudioPaLM: A Large Language Model That Can Speak and Listen0
PolyVoice: Language Models for Speech to Speech Translation0
Translatotron 3: Speech to Speech Translation with Monolingual Data0
Textless Speech-to-Speech Translation With Limited Parallel DataCode0
AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation0
i-Code Studio: A Configurable and Composable Framework for Integrative AI0
Duplex Diffusion Models Improve Speech-to-Speech Translation0
ESPnet-ST-v2: Multipurpose Spoken Language Translation ToolkitCode0
Enhancing Speech-to-Speech Translation with Multiple TTS Targets0
A Holistic Cascade System, benchmark, and Human Evaluation Protocol for Expressive Speech-to-Speech Translation0
UnitY: Two-pass Direct Speech-to-speech Translation with Discrete UnitsCode0
Direct Speech-to-speech Translation without Textual Annotation using Bottleneck Features0
Dialogs Re-enacted Across LanguagesCode0
Speech-to-Speech Translation For A Real-world Unwritten Language0
SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations0
Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech TranslationCode0
Textless Direct Speech-to-Speech Translation with Discrete Speech Representation0
Improving Speech-to-Speech Translation Through Unlabeled Text0
A Textless Metric for Speech-to-Speech ComparisonCode0
Augmentation Invariant Discrete Representation for Generative Spoken Language Modeling0
SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation0
Findings of the IWSLT 2022 Evaluation Campaign0
Pretrained Speech Encoders and Efficient Fine-tuning Methods for Speech Translation: UPC at IWSLT 2022Code0
The HW-TSC’s Speech to Speech Translation System for IWSLT 2022 Evaluation0
MLLP-VRAIN UPV systems for the IWSLT 2022 Simultaneous Speech Translation and Speech-to-Speech Translation tasks0
LibriS2S: A German-English Speech-to-Speech Translation CorpusCode0
Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation0
Prosodic Alignment for off-screen automatic dubbing0
Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation0
Evaluating MT Systems: A Theoretical Framework0
Textless Speech-to-Speech Translation on Real Data0
Multimodal and Multilingual Embeddings for Large-Scale Speech MiningCode0
Assessing Evaluation Metrics for Speech-to-Speech Translation0
From Start to Finish: Latency Reduction Strategies for Incremental Speech Synthesis in Simultaneous Speech-to-Speech Translation0
Show:102550
← PrevPage 2 of 3Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Hokkien→En (Two-pass decoding)ASR-BLEU (Dev)13.6Unverified
2Hokkien→En (Three-stage)ASR-BLEU (Dev)12.5Unverified
3Hokkien→En (Two-stage)ASR-BLEU (Dev)12.5Unverified
4Hokkien→En (Single-pass decoding)ASR-BLEU (Dev)8.8Unverified
5En→Hokkien (Two-pass decoding)ASR-BLEU (Dev)7.8Unverified
6En→Hokkien (Three-stage)ASR-BLEU (Dev)7.5Unverified
7En→Hokkien (Two-stage)ASR-BLEU (Dev)7.1Unverified
8En→Hokkien (Single-pass decoding)ASR-BLEU (Dev)6.6Unverified
#ModelMetricClaimedVerifiedStatus
1GenTranslateV2ASR-BLEU32.3Unverified
2GenTranslateV1ASR-BLEU30.1Unverified
3SeamlessM4T LargeV2ASR-BLEU29.4Unverified
4SeamlessM4T LargeASR-BLEU25.8Unverified
5AudioPaLM2ASR-BLEU24Unverified
6WhisperV2ASR-BLEU23.5Unverified
7SeamlessM4T MediumASR-BLEU20.4Unverified
#ModelMetricClaimedVerifiedStatus
1SeamlessM4T LargeASR-BLEU36.5Unverified
2SeamlessM4T MediumASR-BLEU28.1Unverified