| Adversarial training of Keyword Spotting to Minimize TTS Data Overfitting | Aug 20, 2024 | Keyword Spottingtext-to-speech | —Unverified | 0 |
| kNN Retrieval for Simple and Effective Zero-Shot Multi-speaker Text-to-Speech | Aug 20, 2024 | RetrievalSelf-Supervised Learning | —Unverified | 0 |
| Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition | Aug 17, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation | Aug 14, 2024 | Speech Synthesistext-to-speech | CodeCode Available | 3 |
| Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation | Aug 13, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| SaSLaW: Dialogue Speech Corpus with Audio-visual Egocentric Information Toward Environment-adaptive Dialogue Speech Synthesis | Aug 13, 2024 | Speech SynthesisSpoken Dialogue Systems | CodeCode Available | 0 |
| PRESENT: Zero-Shot Text-to-Prosody Control | Aug 13, 2024 | Prosody PredictionSpeech Synthesis | CodeCode Available | 1 |
| FLEURS-R: A Restored Multilingual Speech Corpus for Generation Tasks | Aug 12, 2024 | Few-Shot Learningtext-to-speech | —Unverified | 0 |
| VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing | Aug 11, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| ALIF: Low-Cost Adversarial Audio Attacks on Black-Box Speech Platforms using Linguistic Features | Aug 3, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | CodeCode Available | 1 |
| Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation | Aug 1, 2024 | Representation LearningSpeech Synthesis | —Unverified | 0 |
| On the Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition | Jul 31, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Speech Bandwidth Expansion Via High Fidelity Generative Adversarial Networks | Jul 26, 2024 | Generative Adversarial NetworkSpeech Enhancement | —Unverified | 0 |
| On the Effect of Purely Synthetic Training Data for Different Automatic Speech Recognition Architectures | Jul 25, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Zero-Shot vs. Few-Shot Multi-Speaker TTS Using Pre-trained Czech SpeechT5 Model | Jul 24, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Synth4Kws: Synthesized Speech for User Defined Keyword Spotting in Low Resource Environments | Jul 23, 2024 | DiversityKeyword Spotting | —Unverified | 0 |
| Braille-to-Speech Generator: Audio Generation Based on Joint Fine-Tuning of CLIP and Fastspeech2 | Jul 19, 2024 | Audio GenerationAudio Synthesis | —Unverified | 0 |
| Handling Numeric Expressions in Automatic Speech Recognition | Jul 18, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models | Jul 18, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| SpikeVoice: High-Quality Text-to-Speech Via Efficient Spiking Neural Network | Jul 17, 2024 | text-to-speechText to Speech | CodeCode Available | 0 |
| Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech | Jul 17, 2024 | Speech-to-Speech Translationtext-to-speech | CodeCode Available | 1 |
| TTSDS -- Text-to-Speech Distribution Score | Jul 17, 2024 | text-to-speechText to Speech | CodeCode Available | 2 |
| A Language Modeling Approach to Diacritic-Free Hebrew TTS | Jul 16, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Learning High-Frequency Functions Made Easy with Sinusoidal Positional Encoding | Jul 12, 2024 | regressiontext-to-speech | CodeCode Available | 0 |
| Autoregressive Speech Synthesis without Vector Quantization | Jul 11, 2024 | Audio CompressionDiversity | —Unverified | 0 |
| Source Tracing of Audio Deepfake Systems | Jul 10, 2024 | Face Swappingtext-to-speech | —Unverified | 0 |
| ASRRL-TTS: Agile Speaker Representation Reinforcement Learning for Text-to-Speech Speaker Adaptation | Jul 7, 2024 | Sentencetext-to-speech | —Unverified | 0 |
| Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation | Jul 7, 2024 | Text to Speech | CodeCode Available | 0 |
| CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens | Jul 7, 2024 | Language ModellingLarge Language Model | CodeCode Available | 11 |
| Optimizing a-DCF for Spoofing-Robust Speaker Verification | Jul 4, 2024 | Speaker VerificationText to Speech | —Unverified | 0 |
| Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis | Jul 4, 2024 | Accented Speech RecognitionAutomatic Speech Recognition | —Unverified | 0 |
| On the Effectiveness of Acoustic BPE in Decoder-Only TTS | Jul 4, 2024 | DecoderDiversity | —Unverified | 0 |
| CATT: Character-based Arabic Tashkeel Transformer | Jul 3, 2024 | Arabic Text DiacritizationDecoder | CodeCode Available | 2 |
| TTSlow: Slow Down Text-to-Speech with Efficiency Robustness Evaluations | Jul 2, 2024 | Benchmarkingtext-to-speech | —Unverified | 0 |
| Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization | Jul 2, 2024 | Inference OptimizationSpeech Synthesis | —Unverified | 0 |
| Lightweight Zero-shot Text-to-Speech with Mixture of Adapters | Jul 1, 2024 | DecoderSpeech Synthesis | —Unverified | 0 |
| FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis | Jun 30, 2024 | CPUDecoder | —Unverified | 0 |
| NAIST Simultaneous Speech Translation System for IWSLT 2024 | Jun 30, 2024 | Speech-to-Speech TranslationSpeech-to-Text | —Unverified | 0 |
| Open-Source Conversational AI with SpeechBrain 1.0 | Jun 29, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Application of ASV for Voice Identification after VC and Duration Predictor Improvement in TTS Models | Jun 27, 2024 | Speaker Verificationtext-to-speech | —Unverified | 0 |
| DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability | Jun 27, 2024 | Speech Synthesistext-to-speech | CodeCode Available | 2 |
| E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS | Jun 26, 2024 | text-to-speechText to Speech | CodeCode Available | 1 |
| Automatic Speech Recognition for Hindi | Jun 26, 2024 | Action DetectionActivity Detection | —Unverified | 0 |
| LLM-Driven Multimodal Opinion Expression Identification | Jun 26, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model | Jun 25, 2024 | Computational EfficiencyLanguage Modeling | —Unverified | 0 |
| Leveraging Parameter-Efficient Transfer Learning for Multi-Lingual Text-to-Speech Adaptation | Jun 25, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment | Jun 25, 2024 | DecoderLanguage Modeling | —Unverified | 0 |
| Towards Zero-Shot Text-To-Speech for Arabic Dialects | Jun 24, 2024 | Dialect IdentificationSpeech Synthesis | —Unverified | 0 |
| A multi-speaker multi-lingual voice cloning system based on vits2 for limmits 2024 challenge | Jun 22, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| TacoLM: GaTed Attention Equipped Codec Language Model are Efficient Zero-Shot Text to Speech Synthesizers | Jun 22, 2024 | DecoderLanguage Modeling | CodeCode Available | 1 |