| Hear Your Code Fail, Voice-Assisted Debugging for Python | Jul 20, 2025 | CPUMedical Diagnosis | —Unverified | 0 |
| NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech | Jul 17, 2025 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| P.808 Multilingual Speech Enhancement Testing: Approach and Results of URGENT 2025 Challenge | Jul 15, 2025 | Speech Enhancementtext-to-speech | —Unverified | 0 |
| An Empirical Evaluation of AI-Powered Non-Player Characters' Perceived Realism and Performance in Virtual Reality Environments | Jul 14, 2025 | Speech-to-Texttext-to-speech | —Unverified | 0 |
| ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching | Jul 12, 2025 | Dialogue Generationtext-to-speech | CodeCode Available | 4 |
| Exploiting Leaderboards for Large-Scale Distribution of Malicious Models | Jul 11, 2025 | Model DiscoveryText Generation | —Unverified | 0 |
| MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling | Jul 11, 2025 | Audio SynthesisLanguage Modelling | —Unverified | 0 |
| Differentiable Reward Optimization for LLM based TTS system | Jul 8, 2025 | text-to-speechText to Speech | CodeCode Available | 2 |
| Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis | Jul 8, 2025 | Data AugmentationMixture-of-Experts | —Unverified | 0 |
| PresentAgent: Multimodal Agent for Presentation Video Generation | Jul 5, 2025 | text-to-speechText to Speech | CodeCode Available | 2 |
| An Exploration of ECAPA-TDNN and x-vector Speaker Representations in Zero-shot Multi-speaker TTS | Jun 25, 2025 | Speaker Recognitiontext-to-speech | —Unverified | 0 |
| TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems | Jun 24, 2025 | text-to-speechText to Speech | —Unverified | 0 |
| RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching | Jun 20, 2025 | SchedulingSpeech Synthesis | CodeCode Available | 2 |
| LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization | Jun 20, 2025 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Optimizing Multilingual Text-To-Speech with Accents & Emotions | Jun 19, 2025 | DisentanglementEmotion Recognition | —Unverified | 0 |
| Streaming Non-Autoregressive Model for Accent Conversion and Pronunciation Improvement | Jun 19, 2025 | text-to-speechText to Speech | —Unverified | 0 |
| InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems | Jun 19, 2025 | BenchmarkingDescriptive | CodeCode Available | 1 |
| PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction | Jun 18, 2025 | Sentencetext-to-speech | —Unverified | 0 |
| EmoNews: A Spoken Dialogue System for Expressive News Conversations | Jun 16, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching | Jun 16, 2025 | DecoderSpeech Synthesis | CodeCode Available | 4 |
| Phonikud: Hebrew Grapheme-to-Phoneme Conversion for Real-Time Text-to-Speech | Jun 14, 2025 | Grapheme-to-Phoneme Conversiontext-to-speech | —Unverified | 0 |
| StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling | Jun 14, 2025 | text-to-speechText to Speech | —Unverified | 0 |
| Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs | Jun 12, 2025 | Speech-to-Speech Translationtext-to-speech | —Unverified | 0 |
| S2ST-Omni: An Efficient and Scalable Multilingual Speech-to-Speech Translation Framework via Seamless Speech-Text Alignment and Streaming Speech Generation | Jun 11, 2025 | Reading ComprehensionSpeech Synthesis | —Unverified | 0 |
| Ming-Omni: A Unified Multimodal Model for Perception and Generation | Jun 11, 2025 | Image Generationtext-to-speech | CodeCode Available | 4 |
| UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching | Jun 11, 2025 | Speech Synthesistext-to-speech | —Unverified | 0 |
| GUIRoboTron-Speech: Towards Automated GUI Agents Based on Speech Instructions | Jun 10, 2025 | text-to-speechText to Speech | CodeCode Available | 1 |
| A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data | Jun 10, 2025 | text-to-speechText to Speech | —Unverified | 0 |
| Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation | Jun 9, 2025 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Seeing Voices: Generating A-Roll Video from Audio with Mirage | Jun 9, 2025 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Voice Impression Control in Zero-Shot TTS | Jun 6, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Intelligibility of Text-to-Speech Systems for Mathematical Expressions | Jun 5, 2025 | text-to-speechText to Speech | —Unverified | 0 |
| Grapheme-Coherent Phonemic and Prosodic Annotation of Speech by Implicit and Explicit Grapheme Conditioning | Jun 5, 2025 | text-to-speechText to Speech | —Unverified | 0 |
| HiFiTTS-2: A Large-Scale High Bandwidth Speech Dataset | Jun 4, 2025 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Can we reconstruct a dysarthric voice with the large speech model Parler TTS? | Jun 4, 2025 | text-to-speechText to Speech | —Unverified | 0 |
| A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions | Jun 4, 2025 | Data AugmentationDiversity | —Unverified | 0 |
| BitTTS: Highly Compact Text-to-Speech Using 1.58-bit Quantization and Weight Indexing | Jun 4, 2025 | Quantizationtext-to-speech | —Unverified | 0 |
| UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation | Jun 4, 2025 | cross-modal alignmentLipreading | —Unverified | 0 |
| Towards a Japanese Full-duplex Spoken Dialogue System | Jun 3, 2025 | Spoken Dialogue Systemstext-to-speech | —Unverified | 0 |
| CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech | Jun 3, 2025 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Prompt-Unseen-Emotion: Zero-shot Expressive Speech Synthesis with Prompt-LLM Contextual Knowledge for Mixed Emotions | Jun 3, 2025 | Expressive Speech SynthesisPrompt Learning | —Unverified | 0 |
| Zero-Shot Text-to-Speech for Vietnamese | Jun 2, 2025 | text-to-speechText to Speech | —Unverified | 0 |
| SALF-MOS: Speaker Agnostic Latent Features Downsampled for MOS Prediction | Jun 2, 2025 | Speech Synthesistext-to-speech | —Unverified | 0 |
| WCTC-Biasing: Retraining-free Contextual Biasing ASR with Wildcard CTC-based Keyword Spotting and Inter-layer Biasing | Jun 2, 2025 | Keyword Spottingspeech-recognition | —Unverified | 0 |
| Counterfactual Activation Editing for Post-hoc Prosody and Mispronunciation Correction in TTS Models | Jun 1, 2025 | counterfactualSpeech Synthesis | —Unverified | 0 |
| Chain-of-Thought Training for Open E2E Spoken Dialogue Systems | May 31, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Werewolf: A Straightforward Game Framework with TTS for Improved User Engagement | May 30, 2025 | text-to-speechText to Speech | —Unverified | 0 |
| Speech Token Prediction via Compressed-to-fine Language Modeling for Speech Generation | May 30, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Can Emotion Fool Anti-spoofing? | May 29, 2025 | Emotion RecognitionSpeech Emotion Recognition | —Unverified | 0 |
| LLM-Synth4KWS: Scalable Automatic Generation and Synthesis of Confusable Data for Custom Keyword Spotting | May 29, 2025 | Keyword Spottingtext-to-speech | —Unverified | 0 |