| NAIST Simultaneous Speech Translation System for IWSLT 2024 | Jun 30, 2024 | Speech-to-Speech TranslationSpeech-to-Text | —Unverified | 0 |
| FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis | Jun 30, 2024 | CPUDecoder | —Unverified | 0 |
| Open-Source Conversational AI with SpeechBrain 1.0 | Jun 29, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Application of ASV for Voice Identification after VC and Duration Predictor Improvement in TTS Models | Jun 27, 2024 | Speaker Verificationtext-to-speech | —Unverified | 0 |
| Automatic Speech Recognition for Hindi | Jun 26, 2024 | Action DetectionActivity Detection | —Unverified | 0 |
| LLM-Driven Multimodal Opinion Expression Identification | Jun 26, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model | Jun 25, 2024 | Computational EfficiencyLanguage Modeling | —Unverified | 0 |
| Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment | Jun 25, 2024 | DecoderLanguage Modeling | —Unverified | 0 |
| Leveraging Parameter-Efficient Transfer Learning for Multi-Lingual Text-to-Speech Adaptation | Jun 25, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Towards Zero-Shot Text-To-Speech for Arabic Dialects | Jun 24, 2024 | Dialect IdentificationSpeech Synthesis | —Unverified | 0 |
| A multi-speaker multi-lingual voice cloning system based on vits2 for limmits 2024 challenge | Jun 22, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| InterBiasing: Boost Unseen Word Recognition through Biasing Intermediate Predictions | Jun 21, 2024 | speech-recognitionSpeech Recognition | —Unverified | 0 |
| DASB -- Discrete Audio and Speech Benchmark | Jun 20, 2024 | BenchmarkingEmotion Recognition | —Unverified | 0 |
| Instruction Data Generation and Unsupervised Adaptation for Speech Language Models | Jun 18, 2024 | Synthetic Data Generationtext-to-speech | —Unverified | 0 |
| Multi-Scale Accent Modeling and Disentangling for Multi-Speaker Multi-Accent Text-to-Speech Synthesis | Jun 16, 2024 | DisentanglementSpeech Synthesis | —Unverified | 0 |
| Phoneme Discretized Saliency Maps for Explainable Detection of AI-Generated Voice | Jun 14, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing | Jun 13, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| DisfluencySpeech -- Single-Speaker Conversational Speech Dataset with Paralanguage | Jun 13, 2024 | Sentencetext-to-speech | —Unverified | 0 |
| VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment | Jun 12, 2024 | QuantizationSpeech Synthesis | —Unverified | 0 |
| VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual Text-to-Speech | Jun 12, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Audio-conditioned phonemic and prosodic annotation for building text-to-speech models from unlabeled speech data | Jun 12, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data? | Jun 11, 2024 | Contrastive LearningSpeech Synthesis | —Unverified | 0 |
| Meta Learning Text-to-Speech Synthesis in over 7000 Languages | Jun 10, 2024 | Meta-LearningSpeech Synthesis | —Unverified | 0 |
| MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing Voice Synthesis via Classifier-free Diffusion Guidance | Jun 10, 2024 | Singing Voice Synthesistext-to-speech | —Unverified | 0 |
| Controlling Emotion in Text-to-Speech with Natural Language Prompts | Jun 10, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Text-aware and Context-aware Expressive Audiobook Speech Synthesis | Jun 9, 2024 | Contrastive LearningLanguage Modeling | —Unverified | 0 |
| An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS | Jun 9, 2024 | DenoisingSpeech Denoising | —Unverified | 0 |
| VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers | Jun 8, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Autoregressive Diffusion Transformer for Text-to-Speech Synthesis | Jun 8, 2024 | Audio GenerationDecoder | —Unverified | 0 |
| Boosting Diffusion Model for Spectrogram Up-sampling in Text-to-speech: An Empirical Study | Jun 7, 2024 | DiversityLanguage Modeling | —Unverified | 0 |
| Spectral Codecs: Improving Non-Autoregressive Speech Synthesis with Spectrogram-Based Audio Codecs | Jun 7, 2024 | QuantizationSpeech Synthesis | —Unverified | 0 |
| A Human-in-the-Loop Approach to Improving Cross-Text Prosody Transfer | Jun 6, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model | Jun 6, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Total-Duration-Aware Duration Modeling for Text-to-Speech Systems | Jun 6, 2024 | Diversitytext-to-speech | —Unverified | 0 |
| Harder or Different? Understanding Generalization of Audio Deepfake Detection | Jun 5, 2024 | Audio Deepfake DetectionDeepFake Detection | —Unverified | 0 |
| Task Arithmetic can Mitigate Synthetic-to-Real Gap in Automatic Speech Recognition | Jun 5, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Style Mixture of Experts for Expressive Text-To-Speech Synthesis | Jun 5, 2024 | Mixture-of-ExpertsSpeech Synthesis | —Unverified | 0 |
| Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis | Jun 4, 2024 | In-Context LearningLanguage Modeling | —Unverified | 0 |
| BiVocoder: A Bidirectional Neural Vocoder Integrating Feature Extraction and Waveform Generation | Jun 4, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing | Jun 4, 2024 | DecoderLanguage Modeling | —Unverified | 0 |
| Accent Conversion in Text-To-Speech Using Multi-Level VAE and Adversarial Training | Jun 3, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback | Jun 2, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities | May 29, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition | May 24, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| DLPO: Diffusion Model Loss-Guided Reinforcement Learning for Fine-Tuning Text-to-Speech Diffusion Models | May 23, 2024 | Image Generationreinforcement-learning | —Unverified | 0 |
| Multilingual Prosody Transfer: Comparing Supervised & Transfer Learning | May 23, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Multi-speaker Text-to-speech Training with Speaker Anonymized Data | May 20, 2024 | Speaker anonymizationtext-to-speech | —Unverified | 0 |
| VR-GPT: Visual Language Model for Intelligent Virtual Reality Applications | May 19, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Exploring speech style spaces with language models: Emotional TTS without emotion labels | May 18, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model | May 16, 2024 | HallucinationLanguage Modeling | —Unverified | 0 |