Speech-to-Speech Translation

Speech-to-speech translation (S2ST) consists on translating speech from one language to speech in another language. This can be done with a cascade of automatic speech recognition (ASR), text-to-text machine translation (MT), and text-to-speech (TTS) synthesis sub-systems, which is text-centric. Recently, works on S2ST without relying on intermediate text representation is emerging.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 51–100 of 117 papers

Title	Date	Tasks	Status
NAIST Simultaneous Speech Translation System for IWSLT 2024	Jun 30, 2024	Speech-to-Speech TranslationSpeech-to-Text	—Unverified
Augmentation Invariant Discrete Representation for Generative Spoken Language Modeling	Sep 30, 2022	Language ModelingLanguage Modelling	—Unverified
Phi-Omni-ST: A multimodal language model for direct speech-to-speech translation	Jun 4, 2025	Language ModelingLanguage Modelling	—Unverified
PolySinger: Singing-Voice to Singing-Voice Translation from English to Japanese	Jul 19, 2024	Singing Voice SynthesisSpeech-to-Speech Translation	—Unverified
PolyVoice: Language Models for Speech to Speech Translation	Jun 5, 2023	Language ModelingLanguage Modelling	—Unverified
Portable Speech-to-Speech Translation on an Android Smartphone: The MFLTS System	Mar 1, 2018	Speech RecognitionSpeech-to-Speech Translation	—Unverified
Preset-Voice Matching for Privacy Regulated Speech-to-Speech Translation Systems	Jul 18, 2024	Speech-to-Speech TranslationVoice Cloning	—Unverified
What does it take to get state of the art in simultaneous speech-to-speech translation?	Sep 2, 2024	HallucinationManagement	—Unverified
A Case Study on Filtering for End-to-End Speech Translation	Feb 2, 2024	Speech-to-Speech TranslationSpeech-to-Text	—Unverified
A Holistic Cascade System, benchmark, and Human Evaluation Protocol for Expressive Speech-to-Speech Translation	Jan 25, 2023	Speech-to-Speech TranslationTranslation	—Unverified
Analyzing Speech Unit Selection for Textless Speech-to-Speech Translation	Jul 8, 2024	Automatic Speech RecognitionEmotion Recognition	—Unverified
Assessing Evaluation Metrics for Speech-to-Speech Translation	Oct 26, 2021	Machine TranslationOpen-Ended Question Answering	—Unverified
AudioPaLM: A Large Language Model That Can Speak and Listen	Jun 22, 2023	Language ModelingLanguage Modelling	—Unverified
A Unit-based System and Dataset for Expressive Direct Speech-to-Speech Translation	Feb 1, 2025	Speech-to-Speech TranslationTranslation	—Unverified
Automatic Extraction of Parallel Speech Corpora from Dubbed Movies	Aug 1, 2017	Speech-to-Speech TranslationTranslation	—Unverified
AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation	May 24, 2023	Speech-to-Speech TranslationTranslation	—Unverified
Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM	Feb 24, 2025	Automatic Speech RecognitionLanguage Modeling	—Unverified
Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?	Jun 11, 2024	Contrastive LearningSpeech Synthesis	—Unverified
Connecting Voices: LoReSpeech as a Low-Resource Speech Parallel Corpus	Feb 25, 2025	Speech-to-Speech TranslationTranslation	—Unverified
Cross-Lingual Machine Speech Chain for Javanese, Sundanese, Balinese, and Bataks Speech Recognition and Synthesis	Nov 4, 2020	Machine Translationspeech-recognition	—Unverified
CrossVoice: Crosslingual Prosody Preserving Cascade-S2ST using Transfer Learning	May 23, 2024	es-enfr-en	—Unverified
DiffS2UT: A Semantic Preserving Diffusion Model for Textless Direct Speech-to-Speech Translation	Oct 26, 2023	Image GenerationSpeech-to-Speech Translation	—Unverified
Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation	Jun 14, 2024	Speech-to-Speech TranslationTranslation	—Unverified
Direct Punjabi to English speech translation using discrete units	Feb 25, 2024	Speech-to-Speech TranslationSpeech-to-Text	—Unverified
Direct Simultaneous Speech-to-Speech Translation with Variational Monotonic Multihead Attention	Oct 15, 2021	Simultaneous Speech-to-Speech TranslationSpeech Synthesis	—Unverified
Direct Speech-to-Speech Neural Machine Translation: A Survey	Nov 13, 2024	Machine TranslationSpeech-to-Speech Translation	—Unverified
Direct Speech to Speech Translation: A Review	Mar 3, 2025	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
Direct Speech-to-speech Translation without Textual Annotation using Bottleneck Features	Dec 12, 2022	Speech-to-Speech TranslationTranslation	—Unverified
Direct Text to Speech Translation System using Acoustic Units	Sep 14, 2023	DecoderSpeech-to-Speech Translation	—Unverified
Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing	Jun 4, 2024	DecoderLanguage Modeling	—Unverified
Textless Speech-to-Speech Translation on Real Data	Dec 15, 2021	Speech-to-Speech TranslationTranslation	—Unverified
Textless Streaming Speech-to-Speech Translation using Semantic Speech Tokens	Oct 4, 2024	Language ModelingLanguage Modelling	—Unverified
The HW-TSC’s Speech to Speech Translation System for IWSLT 2022 Evaluation	May 1, 2022	Machine TranslationReranking	—Unverified
Towards Multilingual Conversations in the Medical Domain: Development of Multilingual Medical Data and A Network-based ASR System	May 1, 2014	Machine Translationspeech-recognition	—Unverified
TranSentence: Speech-to-speech Translation via Language-agnostic Sentence-level Speech Encoding without Language-parallel Data	Jan 17, 2024	SentenceSpeech-to-Speech Translation	—Unverified
TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation	Dec 23, 2023	es-enfr-en	—Unverified
Translatotron 2: High-quality direct speech-to-speech translation with voice preservation	Jul 19, 2021	Data AugmentationDecoder	—Unverified
Translatotron 3: Speech to Speech Translation with Monolingual Data	May 27, 2023	Speech-to-Speech TranslationTranslation	—Unverified
UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units	Dec 15, 2022	DecoderDenoising	—Unverified
UWSpeech: Speech to Speech Translation for Unwritten Languages	Jun 14, 2020	speech-recognitionSpeech Recognition	—Unverified
Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs	Jun 12, 2025	Speech-to-Speech Translationtext-to-speech	—Unverified
Prosodic Alignment for off-screen automatic dubbing	Apr 6, 2022	Speech-to-Speech TranslationTranslation	—Unverified
Real-time Incremental Speech-to-Speech Translation of Dialogs	Jun 1, 2012	Machine TranslationSpeech Recognition	—Unverified
S2ST-Omni: An Efficient and Scalable Multilingual Speech-to-Speech Translation Framework via Seamless Speech-Text Alignment and Streaming Speech Generation	Jun 11, 2025	Reading ComprehensionSpeech Synthesis	—Unverified
SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation	May 17, 2022	Representation LearningRetrieval	—Unverified
SeamlessExpressiveLM: Speech Language Model for Expressive Speech-to-Speech Translation with Chain-of-Thought	May 30, 2024	Language ModelingLanguage Modelling	—Unverified
SimulS2S-LLM: Unlocking Simultaneous Inference of Speech LLMs for Speech-to-Speech Translation	Apr 22, 2025	Simultaneous Speech-to-Speech TranslationSpeech-to-Speech Translation	—Unverified
Simultaneous Speech-to-Speech Translation System with Neural Incremental ASR, MT, and TTS	Nov 10, 2020	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
SimulTron: On-Device Simultaneous Speech to Speech Translation	Jun 4, 2024	Simultaneous Speech-to-Speech TranslationSpeech-to-Speech Translation	—Unverified
SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations	Nov 8, 2022	Mixture-of-ExpertsSpeech-to-Speech Translation	—Unverified

Show:10 25 50

← PrevPage 2 of 3Next →

All datasets TAT FLEURS X-eng CVSS

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Hokkien→En (Two-pass decoding)	ASR-BLEU (Dev)	13.6	—	Unverified
2	Hokkien→En (Two-stage)	ASR-BLEU (Dev)	12.5	—	Unverified
3	Hokkien→En (Three-stage)	ASR-BLEU (Dev)	12.5	—	Unverified
4	Hokkien→En (Single-pass decoding)	ASR-BLEU (Dev)	8.8	—	Unverified
5	En→Hokkien (Two-pass decoding)	ASR-BLEU (Dev)	7.8	—	Unverified
6	En→Hokkien (Three-stage)	ASR-BLEU (Dev)	7.5	—	Unverified
7	En→Hokkien (Two-stage)	ASR-BLEU (Dev)	7.1	—	Unverified
8	En→Hokkien (Single-pass decoding)	ASR-BLEU (Dev)	6.6	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GenTranslateV2	ASR-BLEU	32.3	—	Unverified
2	GenTranslateV1	ASR-BLEU	30.1	—	Unverified
3	SeamlessM4T LargeV2	ASR-BLEU	29.4	—	Unverified
4	SeamlessM4T Large	ASR-BLEU	25.8	—	Unverified
5	AudioPaLM2	ASR-BLEU	24	—	Unverified
6	WhisperV2	ASR-BLEU	23.5	—	Unverified
7	SeamlessM4T Medium	ASR-BLEU	20.4	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	SeamlessM4T Large	ASR-BLEU	36.5	—	Unverified
2	SeamlessM4T Medium	ASR-BLEU	28.1	—	Unverified