SOTAVerified

Speech-to-Speech Translation

Speech-to-speech translation (S2ST) consists on translating speech from one language to speech in another language. This can be done with a cascade of automatic speech recognition (ASR), text-to-text machine translation (MT), and text-to-speech (TTS) synthesis sub-systems, which is text-centric. Recently, works on S2ST without relying on intermediate text representation is emerging.

Papers

Showing 5175 of 117 papers

TitleStatusHype
NAIST Simultaneous Speech Translation System for IWSLT 20240
Augmentation Invariant Discrete Representation for Generative Spoken Language Modeling0
Phi-Omni-ST: A multimodal language model for direct speech-to-speech translation0
PolySinger: Singing-Voice to Singing-Voice Translation from English to Japanese0
PolyVoice: Language Models for Speech to Speech Translation0
Portable Speech-to-Speech Translation on an Android Smartphone: The MFLTS System0
Preset-Voice Matching for Privacy Regulated Speech-to-Speech Translation Systems0
What does it take to get state of the art in simultaneous speech-to-speech translation?0
A Case Study on Filtering for End-to-End Speech Translation0
A Holistic Cascade System, benchmark, and Human Evaluation Protocol for Expressive Speech-to-Speech Translation0
Analyzing Speech Unit Selection for Textless Speech-to-Speech Translation0
Assessing Evaluation Metrics for Speech-to-Speech Translation0
AudioPaLM: A Large Language Model That Can Speak and Listen0
A Unit-based System and Dataset for Expressive Direct Speech-to-Speech Translation0
Automatic Extraction of Parallel Speech Corpora from Dubbed Movies0
AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation0
Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM0
Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?0
Connecting Voices: LoReSpeech as a Low-Resource Speech Parallel Corpus0
Cross-Lingual Machine Speech Chain for Javanese, Sundanese, Balinese, and Bataks Speech Recognition and Synthesis0
CrossVoice: Crosslingual Prosody Preserving Cascade-S2ST using Transfer Learning0
DiffS2UT: A Semantic Preserving Diffusion Model for Textless Direct Speech-to-Speech Translation0
Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation0
Direct Punjabi to English speech translation using discrete units0
Direct Simultaneous Speech-to-Speech Translation with Variational Monotonic Multihead Attention0
Show:102550
← PrevPage 3 of 5Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Hokkien→En (Two-pass decoding)ASR-BLEU (Dev)13.6Unverified
2Hokkien→En (Two-stage)ASR-BLEU (Dev)12.5Unverified
3Hokkien→En (Three-stage)ASR-BLEU (Dev)12.5Unverified
4Hokkien→En (Single-pass decoding)ASR-BLEU (Dev)8.8Unverified
5En→Hokkien (Two-pass decoding)ASR-BLEU (Dev)7.8Unverified
6En→Hokkien (Three-stage)ASR-BLEU (Dev)7.5Unverified
7En→Hokkien (Two-stage)ASR-BLEU (Dev)7.1Unverified
8En→Hokkien (Single-pass decoding)ASR-BLEU (Dev)6.6Unverified
#ModelMetricClaimedVerifiedStatus
1GenTranslateV2ASR-BLEU32.3Unverified
2GenTranslateV1ASR-BLEU30.1Unverified
3SeamlessM4T LargeV2ASR-BLEU29.4Unverified
4SeamlessM4T LargeASR-BLEU25.8Unverified
5AudioPaLM2ASR-BLEU24Unverified
6WhisperV2ASR-BLEU23.5Unverified
7SeamlessM4T MediumASR-BLEU20.4Unverified
#ModelMetricClaimedVerifiedStatus
1SeamlessM4T LargeASR-BLEU36.5Unverified
2SeamlessM4T MediumASR-BLEU28.1Unverified