Speech-to-Speech Translation

Speech-to-speech translation (S2ST) consists on translating speech from one language to speech in another language. This can be done with a cascade of automatic speech recognition (ASR), text-to-text machine translation (MT), and text-to-speech (TTS) synthesis sub-systems, which is text-centric. Recently, works on S2ST without relying on intermediate text representation is emerging.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–50 of 117 papers

Title	Date	Tasks	Status	Hype
Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs	Jun 12, 2025	Speech-to-Speech Translationtext-to-speech	—Unverified	0
S2ST-Omni: An Efficient and Scalable Multilingual Speech-to-Speech Translation Framework via Seamless Speech-Text Alignment and Streaming Speech Generation	Jun 11, 2025	Reading ComprehensionSpeech Synthesis	—Unverified	0
Phi-Omni-ST: A multimodal language model for direct speech-to-speech translation	Jun 4, 2025	Language ModelingLanguage Modelling	—Unverified	0
Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing	May 27, 2025	Speech-to-Speech TranslationTranslation	—Unverified	0
Leveraging Unit Language Guidance to Advance Speech Modeling in Textless Speech-to-Speech Translation	May 21, 2025	Language ModelingLanguage Modelling	CodeCode Available	0
Language translation, and change of accent for speech-to-speech task using diffusion model	May 4, 2025	Speech-to-Speech TranslationTranslation	—Unverified	0
Using Phonemes in cascaded S2S translation pipeline	Apr 22, 2025	Simultaneous Speech-to-Speech TranslationSpeech-to-Speech Translation	CodeCode Available	0
SimulS2S-LLM: Unlocking Simultaneous Inference of Speech LLMs for Speech-to-Speech Translation	Apr 22, 2025	Simultaneous Speech-to-Speech TranslationSpeech-to-Speech Translation	—Unverified	0
Direct Speech to Speech Translation: A Review	Mar 3, 2025	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified	0
Connecting Voices: LoReSpeech as a Low-Resource Speech Parallel Corpus	Feb 25, 2025	Speech-to-Speech TranslationTranslation	—Unverified	0
Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM	Feb 24, 2025	Automatic Speech RecognitionLanguage Modeling	—Unverified	0
Speech to Speech Translation with Translatotron: A State of the Art Review	Feb 9, 2025	speech-recognitionSpeech Recognition	—Unverified	0
High-Fidelity Simultaneous Speech-To-Speech Translation	Feb 5, 2025	DecoderSimultaneous Speech-to-Speech Translation	CodeCode Available	5
A Unit-based System and Dataset for Expressive Direct Speech-to-Speech Translation	Feb 1, 2025	Speech-to-Speech TranslationTranslation	—Unverified	0
Improving Lip-synchrony in Direct Audio-Visual Speech-to-Speech Translation	Dec 21, 2024	Speech-to-Speech TranslationTranslation	—Unverified	0
Direct Speech-to-Speech Neural Machine Translation: A Survey	Nov 13, 2024	Machine TranslationSpeech-to-Speech Translation	—Unverified	0
Findings of the IWSLT 2024 Evaluation Campaign	Nov 7, 2024	Speech-to-Speech TranslationTranslation	—Unverified	0
Phonology-Guided Speech-to-Speech Translation for African Languages	Oct 30, 2024	Semantic SimilaritySemantic Textual Similarity	—Unverified	0
Textless Streaming Speech-to-Speech Translation using Semantic Speech Tokens	Oct 4, 2024	Language ModelingLanguage Modelling	—Unverified	0
Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection	Sep 17, 2024	Emotion RecognitionSpeech Emotion Recognition	CodeCode Available	0
What does it take to get state of the art in simultaneous speech-to-speech translation?	Sep 2, 2024	HallucinationManagement	—Unverified	0
PolySinger: Singing-Voice to Singing-Voice Translation from English to Japanese	Jul 19, 2024	Singing Voice SynthesisSpeech-to-Speech Translation	—Unverified	0
Preset-Voice Matching for Privacy Regulated Speech-to-Speech Translation Systems	Jul 18, 2024	Speech-to-Speech TranslationVoice Cloning	—Unverified	0
Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech	Jul 17, 2024	Speech-to-Speech Translationtext-to-speech	CodeCode Available	1
Analyzing Speech Unit Selection for Textless Speech-to-Speech Translation	Jul 8, 2024	Automatic Speech RecognitionEmotion Recognition	—Unverified	0
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs	Jul 4, 2024	Emotion RecognitionEvent Detection	CodeCode Available	11
NAIST Simultaneous Speech Translation System for IWSLT 2024	Jun 30, 2024	Speech-to-Speech TranslationSpeech-to-Text	—Unverified	0
Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation	Jun 14, 2024	Speech-to-Speech TranslationTranslation	—Unverified	0
CTC-based Non-autoregressive Textless Speech-to-Speech Translation	Jun 11, 2024	Knowledge DistillationMachine Translation	CodeCode Available	1
A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation	Jun 11, 2024	DecoderSimultaneous Speech-to-Speech Translation	CodeCode Available	2
Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?	Jun 11, 2024	Contrastive LearningSpeech Synthesis	—Unverified	0
StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning	Jun 5, 2024	Automatic Speech Recognition (ASR)de-en	CodeCode Available	5
Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing	Jun 4, 2024	DecoderLanguage Modeling	—Unverified	0
Textless Acoustic Model with Self-Supervised Distillation for Noise-Robust Expressive Speech-to-Speech Translation	Jun 4, 2024	Speech-to-Speech TranslationTranslation	—Unverified	0
SimulTron: On-Device Simultaneous Speech to Speech Translation	Jun 4, 2024	Simultaneous Speech-to-Speech TranslationSpeech-to-Speech Translation	—Unverified	0
SeamlessExpressiveLM: Speech Language Model for Expressive Speech-to-Speech Translation with Chain-of-Thought	May 30, 2024	Language ModelingLanguage Modelling	—Unverified	0
TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation	May 28, 2024	Machine Translationspeech-recognition	CodeCode Available	2
CrossVoice: Crosslingual Prosody Preserving Cascade-S2ST using Transfer Learning	May 23, 2024	es-enfr-en	—Unverified	0
DiffNorm: Self-Supervised Normalization for Non-autoregressive Speech-to-speech Translation	May 22, 2024	DenoisingNoise Estimation	CodeCode Available	0
MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation	Mar 19, 2024	DecoderLanguage Modeling	—Unverified	0
Direct Punjabi to English speech translation using discrete units	Feb 25, 2024	Speech-to-Speech TranslationSpeech-to-Text	—Unverified	0
GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators	Feb 10, 2024	Machine TranslationSpeech-to-Speech Translation	CodeCode Available	2
A Case Study on Filtering for End-to-End Speech Translation	Feb 2, 2024	Speech-to-Speech TranslationSpeech-to-Text	—Unverified	0
TranSentence: Speech-to-speech Translation via Language-agnostic Sentence-level Speech Encoding without Language-parallel Data	Jan 17, 2024	SentenceSpeech-to-Speech Translation	—Unverified	0
TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation	Dec 23, 2023	es-enfr-en	—Unverified	0
EmphAssess : a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models	Dec 21, 2023	ResynthesisSpeech-to-Speech Translation	CodeCode Available	1
AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation	Dec 5, 2023	Self-Supervised LearningSpeech-to-Speech Translation	CodeCode Available	1
DiffS2UT: A Semantic Preserving Diffusion Model for Textless Direct Speech-to-Speech Translation	Oct 26, 2023	Image GenerationSpeech-to-Speech Translation	—Unverified	0
DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation	Oct 11, 2023	Decoderfr-en	CodeCode Available	1
Enhancing expressivity transfer in textless speech-to-speech translation	Oct 11, 2023	Self-Supervised LearningSpeech-to-Speech Translation	—Unverified	0

Show:10 25 50

← PrevPage 1 of 3Next →

All datasets TAT FLEURS X-eng CVSS

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Hokkien→En (Two-pass decoding)	ASR-BLEU (Dev)	13.6	—	Unverified
2	Hokkien→En (Two-stage)	ASR-BLEU (Dev)	12.5	—	Unverified
3	Hokkien→En (Three-stage)	ASR-BLEU (Dev)	12.5	—	Unverified
4	Hokkien→En (Single-pass decoding)	ASR-BLEU (Dev)	8.8	—	Unverified
5	En→Hokkien (Two-pass decoding)	ASR-BLEU (Dev)	7.8	—	Unverified
6	En→Hokkien (Three-stage)	ASR-BLEU (Dev)	7.5	—	Unverified
7	En→Hokkien (Two-stage)	ASR-BLEU (Dev)	7.1	—	Unverified
8	En→Hokkien (Single-pass decoding)	ASR-BLEU (Dev)	6.6	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GenTranslateV2	ASR-BLEU	32.3	—	Unverified
2	GenTranslateV1	ASR-BLEU	30.1	—	Unverified
3	SeamlessM4T LargeV2	ASR-BLEU	29.4	—	Unverified
4	SeamlessM4T Large	ASR-BLEU	25.8	—	Unverified
5	AudioPaLM2	ASR-BLEU	24	—	Unverified
6	WhisperV2	ASR-BLEU	23.5	—	Unverified
7	SeamlessM4T Medium	ASR-BLEU	20.4	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	SeamlessM4T Large	ASR-BLEU	36.5	—	Unverified
2	SeamlessM4T Medium	ASR-BLEU	28.1	—	Unverified