| UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation | Jun 4, 2025 | cross-modal alignmentLipreading | —Unverified | 0 |
| CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech | Jun 3, 2025 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Prompt-Unseen-Emotion: Zero-shot Expressive Speech Synthesis with Prompt-LLM Contextual Knowledge for Mixed Emotions | Jun 3, 2025 | Expressive Speech SynthesisPrompt Learning | —Unverified | 0 |
| Towards a Japanese Full-duplex Spoken Dialogue System | Jun 3, 2025 | Spoken Dialogue Systemstext-to-speech | —Unverified | 0 |
| Zero-Shot Text-to-Speech for Vietnamese | Jun 2, 2025 | text-to-speechText to Speech | —Unverified | 0 |
| WCTC-Biasing: Retraining-free Contextual Biasing ASR with Wildcard CTC-based Keyword Spotting and Inter-layer Biasing | Jun 2, 2025 | Keyword Spottingspeech-recognition | —Unverified | 0 |
| SALF-MOS: Speaker Agnostic Latent Features Downsampled for MOS Prediction | Jun 2, 2025 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Counterfactual Activation Editing for Post-hoc Prosody and Mispronunciation Correction in TTS Models | Jun 1, 2025 | counterfactualSpeech Synthesis | —Unverified | 0 |
| Chain-of-Thought Training for Open E2E Spoken Dialogue Systems | May 31, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Werewolf: A Straightforward Game Framework with TTS for Improved User Engagement | May 30, 2025 | text-to-speechText to Speech | —Unverified | 0 |
| Speech Token Prediction via Compressed-to-fine Language Modeling for Speech Generation | May 30, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Can Emotion Fool Anti-spoofing? | May 29, 2025 | Emotion RecognitionSpeech Emotion Recognition | —Unverified | 0 |
| LLM-Synth4KWS: Scalable Automatic Generation and Synthesis of Confusable Data for Custom Keyword Spotting | May 29, 2025 | Keyword Spottingtext-to-speech | —Unverified | 0 |
| Few-Shot Speech Deepfake Detection Adaptation with Gaussian Processes | May 29, 2025 | Audio Deepfake DetectionDeepFake Detection | CodeCode Available | 0 |
| Spotlight-TTS: Spotlighting the Style via Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech | May 27, 2025 | Style Transfertext-to-speech | —Unverified | 0 |
| DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech | May 26, 2025 | AttributeEmotional Speech Synthesis | —Unverified | 0 |
| Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling | May 26, 2025 | SentenceSpeech Synthesis | —Unverified | 0 |
| KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization | May 26, 2025 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling | May 26, 2025 | GPUtext-to-speech | —Unverified | 0 |
| SpeakStream: Streaming Text-to-Speech with Interleaved Data | May 25, 2025 | Decodertext-to-speech | —Unverified | 0 |
| Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis | May 25, 2025 | Speech Synthesistext-to-speech | —Unverified | 0 |
| CloneShield: A Framework for Universal Perturbation Against Zero-Shot Voice Cloning | May 25, 2025 | text-to-speechText to Speech | —Unverified | 0 |
| MPE-TTS: Customized Emotion Zero-Shot Text-To-Speech Using Multi-Modal Prompt | May 24, 2025 | text-to-speechText to Speech | —Unverified | 0 |
| RASMALAI: Resources for Adaptive Speech Modeling in Indian Languages with Accents and Intonations | May 24, 2025 | Expressive Speech SynthesisSpeech Synthesis | —Unverified | 0 |
| What You Read Isn't What You Hear: Linguistic Sensitivity in Deepfake Speech Detection | May 23, 2025 | Face SwappingSensitivity | —Unverified | 0 |
| Benchmarking Expressive Japanese Character Text-to-Speech with VITS and Style-BERT-VITS2 | May 22, 2025 | BenchmarkingDialogue Generation | —Unverified | 0 |
| MIKU-PAL: An Automated and Standardized Multi-Modal Method for Speech Paralinguistic and Affect Labeling | May 21, 2025 | Emotion RecognitionFace Detection | —Unverified | 0 |
| Segmentation-Variant Codebooks for Preservation of Paralinguistic and Prosodic Information | May 21, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Voicing Personas: Rewriting Persona Descriptions into Style Prompts for Controllable Text-to-Speech | May 21, 2025 | text-to-speechText to Speech | —Unverified | 0 |
| FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation | May 20, 2025 | Dataset GenerationSpeech Synthesis | —Unverified | 0 |
| Improving Noise Robustness of LLM-based Zero-shot TTS via Discrete Acoustic Token Denoising | May 20, 2025 | DecoderDenoising | —Unverified | 0 |
| Impact of Frame Rates on Speech Tokenizer: A Case Study on Mandarin and English | May 20, 2025 | Automatic Speech Recognitionspeech-recognition | —Unverified | 0 |
| AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models | May 20, 2025 | text-to-speechText to Speech | —Unverified | 0 |
| SeamlessEdit: Background Noise Aware Zero-Shot Speech Editing with in-Context Enhancement | May 20, 2025 | text-to-speechText to Speech | —Unverified | 0 |
| OZSpeech: One-step Zero-shot Speech Synthesis with Learned-Prior-Conditioned Flow Matching | May 19, 2025 | AttributeSpeech Synthesis | —Unverified | 0 |
| Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis | May 18, 2025 | Speech Synthesistext-to-speech | —Unverified | 0 |
| BanglaFake: Constructing and Evaluating a Specialized Bengali Deepfake Audio Dataset | May 16, 2025 | DeepFake DetectionFace Swapping | CodeCode Available | 0 |
| Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese | May 16, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| UDDETTS: Unifying Discrete and Dimensional Emotions for Controllable Emotional Text-to-Speech | May 15, 2025 | Emotional Speech SynthesisLanguage Modeling | —Unverified | 0 |
| MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder | May 12, 2025 | text-to-speechText to Speech | —Unverified | 0 |
| Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications | May 12, 2025 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Bridging the Gap: An Intermediate Language for Enhanced and Cost-Effective Grapheme-to-Phoneme Conversion with Homographs with Multiple Pronunciations Disambiguation | May 10, 2025 | Grapheme-to-Phoneme ConversionLarge Language Model | —Unverified | 0 |
| FlexSpeech: Towards Stable, Controllable and Expressive Text-to-Speech | May 8, 2025 | Style Transfertext-to-speech | —Unverified | 0 |
| Teochew-Wild: The First In-the-wild Teochew Dataset with Orthographic Annotations | May 8, 2025 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Generating Narrated Lecture Videos from Slides with Synchronized Highlights | May 5, 2025 | Mathtext-to-speech | —Unverified | 0 |
| Sadeed: Advancing Arabic Diacritization Through Small Language Model | Apr 30, 2025 | Arabic Text DiacritizationBenchmarking | —Unverified | 0 |
| Towards Flow-Matching-based TTS without Classifier-Free Guidance | Apr 29, 2025 | Speech Synthesistext-to-speech | —Unverified | 0 |
| ClonEval: An Open Voice Cloning Benchmark | Apr 29, 2025 | text-to-speechText to Speech | CodeCode Available | 0 |
| A Multi-Agent Framework for Automated Qinqiang Opera Script Generation Using Large Language Models | Apr 22, 2025 | cross-modal alignmentScript Generation | —Unverified | 0 |
| EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting | Apr 17, 2025 | text-to-speechText to Speech | —Unverified | 0 |