| EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge | May 29, 2025 | text-to-speechText to Speech | CodeCode Available | 3 |
| LLM-Synth4KWS: Scalable Automatic Generation and Synthesis of Confusable Data for Custom Keyword Spotting | May 29, 2025 | Keyword Spottingtext-to-speech | —Unverified | 0 |
| Spotlight-TTS: Spotlighting the Style via Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech | May 27, 2025 | Style Transfertext-to-speech | —Unverified | 0 |
| Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling | May 26, 2025 | SentenceSpeech Synthesis | —Unverified | 0 |
| KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization | May 26, 2025 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech | May 26, 2025 | AttributeEmotional Speech Synthesis | —Unverified | 0 |
| Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling | May 26, 2025 | GPUtext-to-speech | —Unverified | 0 |
| Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment | May 26, 2025 | text-to-speechText to Speech | CodeCode Available | 2 |
| Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis | May 25, 2025 | Speech Synthesistext-to-speech | —Unverified | 0 |
| CloneShield: A Framework for Universal Perturbation Against Zero-Shot Voice Cloning | May 25, 2025 | text-to-speechText to Speech | —Unverified | 0 |
| SpeakStream: Streaming Text-to-Speech with Interleaved Data | May 25, 2025 | Decodertext-to-speech | —Unverified | 0 |
| RASMALAI: Resources for Adaptive Speech Modeling in Indian Languages with Accents and Intonations | May 24, 2025 | Expressive Speech SynthesisSpeech Synthesis | —Unverified | 0 |
| MPE-TTS: Customized Emotion Zero-Shot Text-To-Speech Using Multi-Modal Prompt | May 24, 2025 | text-to-speechText to Speech | —Unverified | 0 |
| Speechless: Speech Instruction Training Without Speech for Low Resource Languages | May 23, 2025 | speech-recognitionSpeech Recognition | CodeCode Available | 7 |
| What You Read Isn't What You Hear: Linguistic Sensitivity in Deepfake Speech Detection | May 23, 2025 | Face SwappingSensitivity | —Unverified | 0 |
| UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information | May 23, 2025 | Large Language ModelQuantization | CodeCode Available | 1 |
| Benchmarking Expressive Japanese Character Text-to-Speech with VITS and Style-BERT-VITS2 | May 22, 2025 | BenchmarkingDialogue Generation | —Unverified | 0 |
| From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition | May 22, 2025 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | CodeCode Available | 1 |
| Voicing Personas: Rewriting Persona Descriptions into Style Prompts for Controllable Text-to-Speech | May 21, 2025 | text-to-speechText to Speech | —Unverified | 0 |
| Segmentation-Variant Codebooks for Preservation of Paralinguistic and Prosodic Information | May 21, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| MIKU-PAL: An Automated and Standardized Multi-Modal Method for Speech Paralinguistic and Affect Labeling | May 21, 2025 | Emotion RecognitionFace Detection | —Unverified | 0 |
| Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models | May 21, 2025 | Bayesian OptimizationSpeech Synthesis | CodeCode Available | 1 |
| Impact of Frame Rates on Speech Tokenizer: A Case Study on Mandarin and English | May 20, 2025 | Automatic Speech Recognitionspeech-recognition | —Unverified | 0 |
| AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models | May 20, 2025 | text-to-speechText to Speech | —Unverified | 0 |
| SeamlessEdit: Background Noise Aware Zero-Shot Speech Editing with in-Context Enhancement | May 20, 2025 | text-to-speechText to Speech | —Unverified | 0 |
| Improving Noise Robustness of LLM-based Zero-shot TTS via Discrete Acoustic Token Denoising | May 20, 2025 | DecoderDenoising | —Unverified | 0 |
| FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation | May 20, 2025 | Dataset GenerationSpeech Synthesis | —Unverified | 0 |
| OZSpeech: One-step Zero-shot Speech Synthesis with Learned-Prior-Conditioned Flow Matching | May 19, 2025 | AttributeSpeech Synthesis | —Unverified | 0 |
| Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis | May 18, 2025 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese | May 16, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| BanglaFake: Constructing and Evaluating a Specialized Bengali Deepfake Audio Dataset | May 16, 2025 | DeepFake DetectionFace Swapping | CodeCode Available | 0 |
| UDDETTS: Unifying Discrete and Dimensional Emotions for Controllable Emotional Text-to-Speech | May 15, 2025 | Emotional Speech SynthesisLanguage Modeling | —Unverified | 0 |
| MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder | May 12, 2025 | text-to-speechText to Speech | —Unverified | 0 |
| Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications | May 12, 2025 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Bridging the Gap: An Intermediate Language for Enhanced and Cost-Effective Grapheme-to-Phoneme Conversion with Homographs with Multiple Pronunciations Disambiguation | May 10, 2025 | Grapheme-to-Phoneme ConversionLarge Language Model | —Unverified | 0 |
| FlexSpeech: Towards Stable, Controllable and Expressive Text-to-Speech | May 8, 2025 | Style Transfertext-to-speech | —Unverified | 0 |
| Teochew-Wild: The First In-the-wild Teochew Dataset with Orthographic Annotations | May 8, 2025 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model | May 6, 2025 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | CodeCode Available | 4 |
| Generating Narrated Lecture Videos from Slides with Synchronized Highlights | May 5, 2025 | Mathtext-to-speech | —Unverified | 0 |
| Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play | May 5, 2025 | AI AgentAutomatic Speech Recognition | CodeCode Available | 3 |
| Sadeed: Advancing Arabic Diacritization Through Small Language Model | Apr 30, 2025 | Arabic Text DiacritizationBenchmarking | —Unverified | 0 |
| Towards Flow-Matching-based TTS without Classifier-Free Guidance | Apr 29, 2025 | Speech Synthesistext-to-speech | —Unverified | 0 |
| ClonEval: An Open Voice Cloning Benchmark | Apr 29, 2025 | text-to-speechText to Speech | CodeCode Available | 0 |
| A Multi-Agent Framework for Automated Qinqiang Opera Script Generation Using Large Language Models | Apr 22, 2025 | cross-modal alignmentScript Generation | —Unverified | 0 |
| EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting | Apr 17, 2025 | text-to-speechText to Speech | —Unverified | 0 |
| GOAT-TTS: Expressive and Realistic Speech Generation via A Dual-Branch LLM | Apr 15, 2025 | QuantizationReading Comprehension | —Unverified | 0 |
| Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis | Apr 14, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| AutoStyle-TTS: Retrieval-Augmented Generation based Automatic Style Matching Text-to-Speech Synthesis | Apr 14, 2025 | RAGRetrieval-augmented Generation | —Unverified | 0 |
| Generalized Multilingual Text-to-Speech Generation with Language-Aware Style Adaptation | Apr 11, 2025 | text-to-speechText to Speech | —Unverified | 0 |
| Empowering Global Voices: A Data-Efficient, Phoneme-Tone Adaptive Approach to High-Fidelity Speech Synthesis | Apr 10, 2025 | Speech Synthesistext-to-speech | —Unverified | 0 |