SOTAVerified

Speech-to-Speech Translation

Speech-to-speech translation (S2ST) consists on translating speech from one language to speech in another language. This can be done with a cascade of automatic speech recognition (ASR), text-to-text machine translation (MT), and text-to-speech (TTS) synthesis sub-systems, which is text-centric. Recently, works on S2ST without relying on intermediate text representation is emerging.

Papers

Showing 150 of 117 papers

TitleStatusHype
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMsCode11
Robust Speech Recognition via Large-Scale Weak SupervisionCode8
AudioLM: a Language Modeling Approach to Audio GenerationCode7
High-Fidelity Simultaneous Speech-To-Speech TranslationCode5
StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task LearningCode5
Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language ModelingCode5
A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech TranslationCode2
TransVIP: Speech to Speech Translation System with Voice and Isochrony PreservationCode2
GenTranslate: Large Language Models are Generative Multilingual Speech and Machine TranslatorsCode2
SeamlessM4T: Massively Multilingual & Multimodal Machine TranslationCode2
BLASER: A Text-Free Speech-to-Speech Translation Evaluation MetricCode2
CVSS Corpus and Massively Multilingual Speech-to-Speech TranslationCode2
Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-SpeechCode1
CTC-based Non-autoregressive Textless Speech-to-Speech TranslationCode1
EmphAssess : a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech ModelsCode1
AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech RepresentationCode1
DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech TranslationCode1
Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech TranslationCode1
Learning When to Speak: Latency and Quality Trade-offs for Simultaneous Speech-to-Speech Translation with Offline ModelsCode1
TranSpeech: Speech-to-Speech Translation With Bilateral PerturbationCode1
Leveraging Pseudo-labeled Data to Improve Direct Speech-to-Speech TranslationCode1
Direct speech-to-speech translation with discrete unitsCode1
Towards Automatic Face-to-Face TranslationCode1
Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs0
S2ST-Omni: An Efficient and Scalable Multilingual Speech-to-Speech Translation Framework via Seamless Speech-Text Alignment and Streaming Speech Generation0
Phi-Omni-ST: A multimodal language model for direct speech-to-speech translation0
Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing0
Leveraging Unit Language Guidance to Advance Speech Modeling in Textless Speech-to-Speech TranslationCode0
Language translation, and change of accent for speech-to-speech task using diffusion model0
SimulS2S-LLM: Unlocking Simultaneous Inference of Speech LLMs for Speech-to-Speech Translation0
Using Phonemes in cascaded S2S translation pipelineCode0
Direct Speech to Speech Translation: A Review0
Connecting Voices: LoReSpeech as a Low-Resource Speech Parallel Corpus0
Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM0
Speech to Speech Translation with Translatotron: A State of the Art Review0
A Unit-based System and Dataset for Expressive Direct Speech-to-Speech Translation0
Improving Lip-synchrony in Direct Audio-Visual Speech-to-Speech Translation0
Direct Speech-to-Speech Neural Machine Translation: A Survey0
Findings of the IWSLT 2024 Evaluation Campaign0
Phonology-Guided Speech-to-Speech Translation for African Languages0
Textless Streaming Speech-to-Speech Translation using Semantic Speech Tokens0
Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data SelectionCode0
What does it take to get state of the art in simultaneous speech-to-speech translation?0
PolySinger: Singing-Voice to Singing-Voice Translation from English to Japanese0
Preset-Voice Matching for Privacy Regulated Speech-to-Speech Translation Systems0
Analyzing Speech Unit Selection for Textless Speech-to-Speech Translation0
NAIST Simultaneous Speech Translation System for IWSLT 20240
Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation0
Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?0
SimulTron: On-Device Simultaneous Speech to Speech Translation0
Show:102550
← PrevPage 1 of 3Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Hokkien→En (Two-pass decoding)ASR-BLEU (Dev)13.6Unverified
2Hokkien→En (Two-stage)ASR-BLEU (Dev)12.5Unverified
3Hokkien→En (Three-stage)ASR-BLEU (Dev)12.5Unverified
4Hokkien→En (Single-pass decoding)ASR-BLEU (Dev)8.8Unverified
5En→Hokkien (Two-pass decoding)ASR-BLEU (Dev)7.8Unverified
6En→Hokkien (Three-stage)ASR-BLEU (Dev)7.5Unverified
7En→Hokkien (Two-stage)ASR-BLEU (Dev)7.1Unverified
8En→Hokkien (Single-pass decoding)ASR-BLEU (Dev)6.6Unverified
#ModelMetricClaimedVerifiedStatus
1GenTranslateV2ASR-BLEU32.3Unverified
2GenTranslateV1ASR-BLEU30.1Unverified
3SeamlessM4T LargeV2ASR-BLEU29.4Unverified
4SeamlessM4T LargeASR-BLEU25.8Unverified
5AudioPaLM2ASR-BLEU24Unverified
6WhisperV2ASR-BLEU23.5Unverified
7SeamlessM4T MediumASR-BLEU20.4Unverified
#ModelMetricClaimedVerifiedStatus
1SeamlessM4T LargeASR-BLEU36.5Unverified
2SeamlessM4T MediumASR-BLEU28.1Unverified