| InterBiasing: Boost Unseen Word Recognition through Biasing Intermediate Predictions | Jun 21, 2024 | speech-recognitionSpeech Recognition | —Unverified | 0 |
| DASB -- Discrete Audio and Speech Benchmark | Jun 20, 2024 | BenchmarkingEmotion Recognition | —Unverified | 0 |
| Instruction Data Generation and Unsupervised Adaptation for Speech Language Models | Jun 18, 2024 | Synthetic Data Generationtext-to-speech | —Unverified | 0 |
| DiTTo-TTS: Diffusion Transformers for Scalable Text-to-Speech without Domain-Specific Factors | Jun 17, 2024 | text-to-speechText to Speech | CodeCode Available | 2 |
| Multi-Scale Accent Modeling and Disentangling for Multi-Speaker Multi-Accent Text-to-Speech Synthesis | Jun 16, 2024 | DisentanglementSpeech Synthesis | —Unverified | 0 |
| Phoneme Discretized Saliency Maps for Explainable Detection of AI-Generated Voice | Jun 14, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| DisfluencySpeech -- Single-Speaker Conversational Speech Dataset with Paralanguage | Jun 13, 2024 | Sentencetext-to-speech | —Unverified | 0 |
| DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing | Jun 13, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Audio-conditioned phonemic and prosodic annotation for building text-to-speech models from unlabeled speech data | Jun 12, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual Text-to-Speech | Jun 12, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning | Jun 12, 2024 | text-to-speechText to Speech | CodeCode Available | 2 |
| VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment | Jun 12, 2024 | QuantizationSpeech Synthesis | —Unverified | 0 |
| EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech | Jun 12, 2024 | Emotional Speech Synthesistext-to-speech | CodeCode Available | 2 |
| Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data? | Jun 11, 2024 | Contrastive LearningSpeech Synthesis | —Unverified | 0 |
| AudioMarkBench: Benchmarking Robustness of Audio Watermarking | Jun 11, 2024 | Benchmarkingtext-to-speech | CodeCode Available | 1 |
| Controlling Emotion in Text-to-Speech with Natural Language Prompts | Jun 10, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Meta Learning Text-to-Speech Synthesis in over 7000 Languages | Jun 10, 2024 | Meta-LearningSpeech Synthesis | —Unverified | 0 |
| MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing Voice Synthesis via Classifier-free Diffusion Guidance | Jun 10, 2024 | Singing Voice Synthesistext-to-speech | —Unverified | 0 |
| Text-aware and Context-aware Expressive Audiobook Speech Synthesis | Jun 9, 2024 | Contrastive LearningLanguage Modeling | —Unverified | 0 |
| WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark | Jun 9, 2024 | text-to-speechText to Speech | CodeCode Available | 2 |
| An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS | Jun 9, 2024 | DenoisingSpeech Denoising | —Unverified | 0 |
| VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers | Jun 8, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Autoregressive Diffusion Transformer for Text-to-Speech Synthesis | Jun 8, 2024 | Audio GenerationDecoder | —Unverified | 0 |
| Boosting Diffusion Model for Spectrogram Up-sampling in Text-to-speech: An Empirical Study | Jun 7, 2024 | DiversityLanguage Modeling | —Unverified | 0 |
| Spectral Codecs: Improving Non-Autoregressive Speech Synthesis with Spectrogram-Based Audio Codecs | Jun 7, 2024 | QuantizationSpeech Synthesis | —Unverified | 0 |
| XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model | Jun 7, 2024 | text-to-speechText to Speech | CodeCode Available | 1 |
| A Human-in-the-Loop Approach to Improving Cross-Text Prosody Transfer | Jun 6, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Small-E: Small Language Model with Linear Attention for Efficient Speech Synthesis | Jun 6, 2024 | DecoderInductive Bias | CodeCode Available | 2 |
| Total-Duration-Aware Duration Modeling for Text-to-Speech Systems | Jun 6, 2024 | Diversitytext-to-speech | —Unverified | 0 |
| Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model | Jun 6, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Harder or Different? Understanding Generalization of Audio Deepfake Detection | Jun 5, 2024 | Audio Deepfake DetectionDeepFake Detection | —Unverified | 0 |
| Style Mixture of Experts for Expressive Text-To-Speech Synthesis | Jun 5, 2024 | Mixture-of-ExpertsSpeech Synthesis | —Unverified | 0 |
| Task Arithmetic can Mitigate Synthetic-to-Real Gap in Automatic Speech Recognition | Jun 5, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing | Jun 4, 2024 | DecoderLanguage Modeling | —Unverified | 0 |
| Seed-TTS: A Family of High-Quality Versatile Speech Generation Models | Jun 4, 2024 | In-Context LearningLanguage Modelling | CodeCode Available | 7 |
| BiVocoder: A Bidirectional Neural Vocoder Integrating Feature Extraction and Waveform Generation | Jun 4, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis | Jun 4, 2024 | In-Context LearningLanguage Modeling | —Unverified | 0 |
| ControlSpeech: Towards Simultaneous and Independent Zero-shot Speaker Cloning and Zero-shot Language Style Control | Jun 3, 2024 | Speech Synthesistext-to-speech | CodeCode Available | 3 |
| Accent Conversion in Text-To-Speech Using Multi-Level VAE and Adversarial Training | Jun 3, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback | Jun 2, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities | May 29, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation | May 28, 2024 | Machine Translationspeech-recognition | CodeCode Available | 2 |
| Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition | May 24, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Multilingual Prosody Transfer: Comparing Supervised & Transfer Learning | May 23, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| DLPO: Diffusion Model Loss-Guided Reinforcement Learning for Fine-Tuning Text-to-Speech Diffusion Models | May 23, 2024 | Image Generationreinforcement-learning | —Unverified | 0 |
| Multi-speaker Text-to-speech Training with Speaker Anonymized Data | May 20, 2024 | Speaker anonymizationtext-to-speech | —Unverified | 0 |
| VR-GPT: Visual Language Model for Intelligent Virtual Reality Applications | May 19, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Exploring speech style spaces with language models: Emotional TTS without emotion labels | May 18, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model | May 16, 2024 | HallucinationLanguage Modeling | —Unverified | 0 |
| Building a Luganda Text-to-Speech Model From Crowdsourced Data | May 16, 2024 | Speech Enhancementtext-to-speech | —Unverified | 0 |