Speech-to-Speech Translation

Speech-to-speech translation (S2ST) consists on translating speech from one language to speech in another language. This can be done with a cascade of automatic speech recognition (ASR), text-to-text machine translation (MT), and text-to-speech (TTS) synthesis sub-systems, which is text-centric. Recently, works on S2ST without relying on intermediate text representation is emerging.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–50 of 117 papers

Title	Date	Tasks	Status	Hype
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs	Jul 4, 2024	Emotion RecognitionEvent Detection	CodeCode Available	11
Robust Speech Recognition via Large-Scale Weak Supervision	Dec 6, 2022	Robust Speech Recognitionspeech-recognition	CodeCode Available	8
AudioLM: a Language Modeling Approach to Audio Generation	Sep 7, 2022	Audio Generation	CodeCode Available	7
High-Fidelity Simultaneous Speech-To-Speech Translation	Feb 5, 2025	DecoderSimultaneous Speech-to-Speech Translation	CodeCode Available	5
StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning	Jun 5, 2024	Automatic Speech Recognition (ASR)de-en	CodeCode Available	5
Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling	Mar 7, 2023	In-Context LearningLanguage Modeling	CodeCode Available	5
A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation	Jun 11, 2024	DecoderSimultaneous Speech-to-Speech Translation	CodeCode Available	2
TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation	May 28, 2024	Machine Translationspeech-recognition	CodeCode Available	2
GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators	Feb 10, 2024	Machine TranslationSpeech-to-Speech Translation	CodeCode Available	2
SeamlessM4T: Massively Multilingual & Multimodal Machine Translation	Aug 22, 2023	Automatic Speech RecognitionMachine Translation	CodeCode Available	2
BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric	Dec 16, 2022	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	CodeCode Available	2
CVSS Corpus and Massively Multilingual Speech-to-Speech Translation	Jan 11, 2022	SentenceSpeech-to-Speech Translation	CodeCode Available	2
Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech	Jul 17, 2024	Speech-to-Speech Translationtext-to-speech	CodeCode Available	1
CTC-based Non-autoregressive Textless Speech-to-Speech Translation	Jun 11, 2024	Knowledge DistillationMachine Translation	CodeCode Available	1
EmphAssess : a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models	Dec 21, 2023	ResynthesisSpeech-to-Speech Translation	CodeCode Available	1
AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation	Dec 5, 2023	Self-Supervised LearningSpeech-to-Speech Translation	CodeCode Available	1
DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation	Oct 11, 2023	Decoderfr-en	CodeCode Available	1
Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation	Aug 3, 2023	DecoderQuantization	CodeCode Available	1
Learning When to Speak: Latency and Quality Trade-offs for Simultaneous Speech-to-Speech Translation with Offline Models	Jun 1, 2023	Simultaneous Speech-to-Speech TranslationSpeech-to-Speech Translation	CodeCode Available	1
TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation	May 25, 2022	Representation LearningRhythm	CodeCode Available	1
Leveraging Pseudo-labeled Data to Improve Direct Speech-to-Speech Translation	May 18, 2022	Speech-to-Speech TranslationTranslation	CodeCode Available	1
Direct speech-to-speech translation with discrete units	Jul 12, 2021	Speech-to-Speech TranslationText Generation	CodeCode Available	1
Towards Automatic Face-to-Face Translation	Mar 1, 2020	Face to Face TranslationMachine Translation	CodeCode Available	1
Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs	Jun 12, 2025	Speech-to-Speech Translationtext-to-speech	—Unverified	0
S2ST-Omni: An Efficient and Scalable Multilingual Speech-to-Speech Translation Framework via Seamless Speech-Text Alignment and Streaming Speech Generation	Jun 11, 2025	Reading ComprehensionSpeech Synthesis	—Unverified	0
Phi-Omni-ST: A multimodal language model for direct speech-to-speech translation	Jun 4, 2025	Language ModelingLanguage Modelling	—Unverified	0
Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing	May 27, 2025	Speech-to-Speech TranslationTranslation	—Unverified	0
Leveraging Unit Language Guidance to Advance Speech Modeling in Textless Speech-to-Speech Translation	May 21, 2025	Language ModelingLanguage Modelling	CodeCode Available	0
Language translation, and change of accent for speech-to-speech task using diffusion model	May 4, 2025	Speech-to-Speech TranslationTranslation	—Unverified	0
SimulS2S-LLM: Unlocking Simultaneous Inference of Speech LLMs for Speech-to-Speech Translation	Apr 22, 2025	Simultaneous Speech-to-Speech TranslationSpeech-to-Speech Translation	—Unverified	0
Using Phonemes in cascaded S2S translation pipeline	Apr 22, 2025	Simultaneous Speech-to-Speech TranslationSpeech-to-Speech Translation	CodeCode Available	0
Direct Speech to Speech Translation: A Review	Mar 3, 2025	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified	0
Connecting Voices: LoReSpeech as a Low-Resource Speech Parallel Corpus	Feb 25, 2025	Speech-to-Speech TranslationTranslation	—Unverified	0
Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM	Feb 24, 2025	Automatic Speech RecognitionLanguage Modeling	—Unverified	0
Speech to Speech Translation with Translatotron: A State of the Art Review	Feb 9, 2025	speech-recognitionSpeech Recognition	—Unverified	0
A Unit-based System and Dataset for Expressive Direct Speech-to-Speech Translation	Feb 1, 2025	Speech-to-Speech TranslationTranslation	—Unverified	0
Improving Lip-synchrony in Direct Audio-Visual Speech-to-Speech Translation	Dec 21, 2024	Speech-to-Speech TranslationTranslation	—Unverified	0
Direct Speech-to-Speech Neural Machine Translation: A Survey	Nov 13, 2024	Machine TranslationSpeech-to-Speech Translation	—Unverified	0
Findings of the IWSLT 2024 Evaluation Campaign	Nov 7, 2024	Speech-to-Speech TranslationTranslation	—Unverified	0
Phonology-Guided Speech-to-Speech Translation for African Languages	Oct 30, 2024	Semantic SimilaritySemantic Textual Similarity	—Unverified	0
Textless Streaming Speech-to-Speech Translation using Semantic Speech Tokens	Oct 4, 2024	Language ModelingLanguage Modelling	—Unverified	0
Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection	Sep 17, 2024	Emotion RecognitionSpeech Emotion Recognition	CodeCode Available	0
What does it take to get state of the art in simultaneous speech-to-speech translation?	Sep 2, 2024	HallucinationManagement	—Unverified	0
PolySinger: Singing-Voice to Singing-Voice Translation from English to Japanese	Jul 19, 2024	Singing Voice SynthesisSpeech-to-Speech Translation	—Unverified	0
Preset-Voice Matching for Privacy Regulated Speech-to-Speech Translation Systems	Jul 18, 2024	Speech-to-Speech TranslationVoice Cloning	—Unverified	0
Analyzing Speech Unit Selection for Textless Speech-to-Speech Translation	Jul 8, 2024	Automatic Speech RecognitionEmotion Recognition	—Unverified	0
NAIST Simultaneous Speech Translation System for IWSLT 2024	Jun 30, 2024	Speech-to-Speech TranslationSpeech-to-Text	—Unverified	0
Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation	Jun 14, 2024	Speech-to-Speech TranslationTranslation	—Unverified	0
Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?	Jun 11, 2024	Contrastive LearningSpeech Synthesis	—Unverified	0
SimulTron: On-Device Simultaneous Speech to Speech Translation	Jun 4, 2024	Simultaneous Speech-to-Speech TranslationSpeech-to-Speech Translation	—Unverified	0

Show:10 25 50

← PrevPage 1 of 3Next →

All datasets TAT FLEURS X-eng CVSS

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Hokkien→En (Two-pass decoding)	ASR-BLEU (Dev)	13.6	—	Unverified
2	Hokkien→En (Two-stage)	ASR-BLEU (Dev)	12.5	—	Unverified
3	Hokkien→En (Three-stage)	ASR-BLEU (Dev)	12.5	—	Unverified
4	Hokkien→En (Single-pass decoding)	ASR-BLEU (Dev)	8.8	—	Unverified
5	En→Hokkien (Two-pass decoding)	ASR-BLEU (Dev)	7.8	—	Unverified
6	En→Hokkien (Three-stage)	ASR-BLEU (Dev)	7.5	—	Unverified
7	En→Hokkien (Two-stage)	ASR-BLEU (Dev)	7.1	—	Unverified
8	En→Hokkien (Single-pass decoding)	ASR-BLEU (Dev)	6.6	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GenTranslateV2	ASR-BLEU	32.3	—	Unverified
2	GenTranslateV1	ASR-BLEU	30.1	—	Unverified
3	SeamlessM4T LargeV2	ASR-BLEU	29.4	—	Unverified
4	SeamlessM4T Large	ASR-BLEU	25.8	—	Unverified
5	AudioPaLM2	ASR-BLEU	24	—	Unverified
6	WhisperV2	ASR-BLEU	23.5	—	Unverified
7	SeamlessM4T Medium	ASR-BLEU	20.4	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	SeamlessM4T Large	ASR-BLEU	36.5	—	Unverified
2	SeamlessM4T Medium	ASR-BLEU	28.1	—	Unverified