Speech-to-Speech Translation

Speech-to-speech translation (S2ST) consists on translating speech from one language to speech in another language. This can be done with a cascade of automatic speech recognition (ASR), text-to-text machine translation (MT), and text-to-speech (TTS) synthesis sub-systems, which is text-centric. Recently, works on S2ST without relying on intermediate text representation is emerging.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–50 of 117 papers

Title	Date	Tasks	Status	Hype	Score
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs	Jul 4, 2024	Emotion RecognitionEvent Detection	CodeCode Available	11	5
Robust Speech Recognition via Large-Scale Weak Supervision	Dec 6, 2022	Robust Speech Recognitionspeech-recognition	CodeCode Available	8	5
AudioLM: a Language Modeling Approach to Audio Generation	Sep 7, 2022	Audio Generation	CodeCode Available	7	5
StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning	Jun 5, 2024	Automatic Speech Recognition (ASR)de-en	CodeCode Available	5	5
Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling	Mar 7, 2023	In-Context LearningLanguage Modeling	CodeCode Available	5	5
High-Fidelity Simultaneous Speech-To-Speech Translation	Feb 5, 2025	DecoderSimultaneous Speech-to-Speech Translation	CodeCode Available	5	5
SeamlessM4T: Massively Multilingual & Multimodal Machine Translation	Aug 22, 2023	Automatic Speech RecognitionMachine Translation	CodeCode Available	2	5
A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation	Jun 11, 2024	DecoderSimultaneous Speech-to-Speech Translation	CodeCode Available	2	5
BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric	Dec 16, 2022	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	CodeCode Available	2	5
CVSS Corpus and Massively Multilingual Speech-to-Speech Translation	Jan 11, 2022	SentenceSpeech-to-Speech Translation	CodeCode Available	2	5
GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators	Feb 10, 2024	Machine TranslationSpeech-to-Speech Translation	CodeCode Available	2	5
TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation	May 28, 2024	Machine Translationspeech-recognition	CodeCode Available	2	5
Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation	Aug 3, 2023	DecoderQuantization	CodeCode Available	1	5
Leveraging Pseudo-labeled Data to Improve Direct Speech-to-Speech Translation	May 18, 2022	Speech-to-Speech TranslationTranslation	CodeCode Available	1	5
DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation	Oct 11, 2023	Decoderfr-en	CodeCode Available	1	5
TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation	May 25, 2022	Representation LearningRhythm	CodeCode Available	1	5
Learning When to Speak: Latency and Quality Trade-offs for Simultaneous Speech-to-Speech Translation with Offline Models	Jun 1, 2023	Simultaneous Speech-to-Speech TranslationSpeech-to-Speech Translation	CodeCode Available	1	5
Direct speech-to-speech translation with discrete units	Jul 12, 2021	Speech-to-Speech TranslationText Generation	CodeCode Available	1	5
CTC-based Non-autoregressive Textless Speech-to-Speech Translation	Jun 11, 2024	Knowledge DistillationMachine Translation	CodeCode Available	1	5
Towards Automatic Face-to-Face Translation	Mar 1, 2020	Face to Face TranslationMachine Translation	CodeCode Available	1	5
EmphAssess : a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models	Dec 21, 2023	ResynthesisSpeech-to-Speech Translation	CodeCode Available	1	5
Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech	Jul 17, 2024	Speech-to-Speech Translationtext-to-speech	CodeCode Available	1	5
AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation	Dec 5, 2023	Self-Supervised LearningSpeech-to-Speech Translation	CodeCode Available	1	5
A Textless Metric for Speech-to-Speech Comparison	Oct 21, 2022	SentenceSpeech-to-Speech Translation	CodeCode Available	0	5
Dialogs Re-enacted Across Languages	Nov 18, 2022	Speech-to-Speech TranslationTranslation	CodeCode Available	0	5
Textless Speech-to-Speech Translation With Limited Parallel Data	May 24, 2023	Automatic Speech RecognitionDenoising	CodeCode Available	0	5
Towards cross-language prosody transfer for dialog	Jul 9, 2023	Speech-to-Speech TranslationTranslation	CodeCode Available	0	5
Using Phonemes in cascaded S2S translation pipeline	Apr 22, 2025	Simultaneous Speech-to-Speech TranslationSpeech-to-Speech Translation	CodeCode Available	0	5
DiffNorm: Self-Supervised Normalization for Non-autoregressive Speech-to-speech Translation	May 22, 2024	DenoisingNoise Estimation	CodeCode Available	0	5
LibriS2S: A German-English Speech-to-Speech Translation Corpus	Apr 22, 2022	Speech-to-Speech TranslationSpeech-to-Text	CodeCode Available	0	5
Leveraging Unit Language Guidance to Advance Speech Modeling in Textless Speech-to-Speech Translation	May 21, 2025	Language ModelingLanguage Modelling	CodeCode Available	0	5
Pretrained Speech Encoders and Efficient Fine-tuning Methods for Speech Translation: UPC at IWSLT 2022	May 1, 2022	DecoderKnowledge Distillation	CodeCode Available	0	5
Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection	Sep 17, 2024	Emotion RecognitionSpeech Emotion Recognition	CodeCode Available	0	5
Direct speech-to-speech translation with a sequence-to-sequence model	Apr 12, 2019	Speech SynthesisSpeech-to-Speech Translation	CodeCode Available	0	5
Direct Speech to Speech Translation: A Review	Mar 3, 2025	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified	0	0
Direct Speech-to-Speech Neural Machine Translation: A Survey	Nov 13, 2024	Machine TranslationSpeech-to-Speech Translation	—Unverified	0	0
Direct Simultaneous Speech-to-Speech Translation with Variational Monotonic Multihead Attention	Oct 15, 2021	Simultaneous Speech-to-Speech TranslationSpeech Synthesis	—Unverified	0	0
Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM	Feb 24, 2025	Automatic Speech RecognitionLanguage Modeling	—Unverified	0	0
Direct Punjabi to English speech translation using discrete units	Feb 25, 2024	Speech-to-Speech TranslationSpeech-to-Text	—Unverified	0	0
Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation	Jun 14, 2024	Speech-to-Speech TranslationTranslation	—Unverified	0	0
AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation	May 24, 2023	Speech-to-Speech TranslationTranslation	—Unverified	0	0
From Speech-to-Speech Translation to Automatic Dubbing	Jan 19, 2020	Machine TranslationSpeech-to-Speech Translation	—Unverified	0	0
Findings of the IWSLT 2024 Evaluation Campaign	Nov 7, 2024	Speech-to-Speech TranslationTranslation	—Unverified	0	0
DiffS2UT: A Semantic Preserving Diffusion Model for Textless Direct Speech-to-Speech Translation	Oct 26, 2023	Image GenerationSpeech-to-Speech Translation	—Unverified	0	0
Assessing Evaluation Metrics for Speech-to-Speech Translation	Oct 26, 2021	Machine TranslationOpen-Ended Question Answering	—Unverified	0	0
A Holistic Cascade System, benchmark, and Human Evaluation Protocol for Expressive Speech-to-Speech Translation	Jan 25, 2023	Speech-to-Speech TranslationTranslation	—Unverified	0	0
Findings of the IWSLT 2022 Evaluation Campaign	May 1, 2022	Speech-to-Speech TranslationTranslation	—Unverified	0	0
Fluent and Low-latency Simultaneous Speech-to-Speech Translation with Self-adaptive Training	Oct 20, 2020	SentenceSimultaneous Speech-to-Speech Translation	—Unverified	0	0
Phonology-Guided Speech-to-Speech Translation for African Languages	Oct 30, 2024	Semantic SimilaritySemantic Textual Similarity	—Unverified	0	0
Evaluating MT Systems: A Theoretical Framework	Feb 11, 2022	Machine TranslationSpeech-to-Speech Translation	—Unverified	0	0

Show:10 25 50

← PrevPage 1 of 3Next →

All datasets TAT FLEURS X-eng CVSS

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Hokkien→En (Two-pass decoding)	ASR-BLEU (Dev)	13.6	—	Unverified
2	Hokkien→En (Two-stage)	ASR-BLEU (Dev)	12.5	—	Unverified
3	Hokkien→En (Three-stage)	ASR-BLEU (Dev)	12.5	—	Unverified
4	Hokkien→En (Single-pass decoding)	ASR-BLEU (Dev)	8.8	—	Unverified
5	En→Hokkien (Two-pass decoding)	ASR-BLEU (Dev)	7.8	—	Unverified
6	En→Hokkien (Three-stage)	ASR-BLEU (Dev)	7.5	—	Unverified
7	En→Hokkien (Two-stage)	ASR-BLEU (Dev)	7.1	—	Unverified
8	En→Hokkien (Single-pass decoding)	ASR-BLEU (Dev)	6.6	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GenTranslateV2	ASR-BLEU	32.3	—	Unverified
2	GenTranslateV1	ASR-BLEU	30.1	—	Unverified
3	SeamlessM4T LargeV2	ASR-BLEU	29.4	—	Unverified
4	SeamlessM4T Large	ASR-BLEU	25.8	—	Unverified
5	AudioPaLM2	ASR-BLEU	24	—	Unverified
6	WhisperV2	ASR-BLEU	23.5	—	Unverified
7	SeamlessM4T Medium	ASR-BLEU	20.4	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	SeamlessM4T Large	ASR-BLEU	36.5	—	Unverified
2	SeamlessM4T Medium	ASR-BLEU	28.1	—	Unverified