| Beyond Text-to-Text: An Overview of Multimodal and Generative Artificial Intelligence for Education Using Topic Modeling | Sep 24, 2024 | Articlestext-to-speech | —Unverified | 0 |
| Facial Expression-Enhanced TTS: Combining Face Representation and Emotion Intensity for Adaptive Speech | Sep 24, 2024 | Emotional Speech SynthesisSpeech Synthesis | —Unverified | 0 |
| LlamaPartialSpoof: An LLM-Driven Fake Speech Dataset Simulating Disinformation Generation | Sep 23, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Zero-shot Cross-lingual Voice Transfer for TTS | Sep 20, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| On the Feasibility of Fully AI-automated Vishing Attacks | Sep 20, 2024 | Large Language ModelSpeech-to-Text | —Unverified | 0 |
| Preference Alignment Improves Language Model-Based TTS | Sep 19, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Enhancing Synthetic Training Data for Speech Commands: From ASR-Based Filtering to Domain Adaptation in SSL Latent Space | Sep 19, 2024 | Automatic Speech RecognitionData Augmentation | —Unverified | 0 |
| SpoofCeleb: Speech Deepfake Detection and SASV In The Wild | Sep 18, 2024 | DeepFake DetectionDiversity | —Unverified | 0 |
| Exploring an Inter-Pausal Unit (IPU) based Approach for Indic End-to-End TTS Systems | Sep 18, 2024 | Sentencetext-to-speech | —Unverified | 0 |
| DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech | Sep 18, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference | Sep 18, 2024 | Audio CompressionLanguage Modeling | —Unverified | 0 |
| Moshi: a speech-text foundation model for real-time dialogue | Sep 17, 2024 | Action DetectionActivity Detection | CodeCode Available | 9 |
| The Art of Storytelling: Multi-Agent Generative AI for Dynamic Multimodal Narratives | Sep 17, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora | Sep 17, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| SpMis: An Investigation of Synthetic Spoken Misinformation Detection | Sep 17, 2024 | Misinformationtext-to-speech | —Unverified | 0 |
| StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion | Sep 16, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization | Sep 16, 2024 | Emotional Speech SynthesisIn-Context Learning | —Unverified | 0 |
| Acquiring Pronunciation Knowledge from Transcribed Speech Audio via Multi-task Learning | Sep 15, 2024 | Multi-Task Learningtext-to-speech | —Unverified | 0 |
| E1 TTS: Simple and Fast Non-Autoregressive TTS | Sep 14, 2024 | Denoisingtext-to-speech | —Unverified | 0 |
| Improving Robustness of Diffusion-Based Zero-Shot Speech Synthesis via Stable Formant Generation | Sep 14, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| SafeEar: Content Privacy-Preserving Audio Deepfake Detection | Sep 14, 2024 | Audio Deepfake DetectionDeepFake Detection | CodeCode Available | 2 |
| AccentBox: Towards High-Fidelity Zero-Shot Accent Generation | Sep 13, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| HLTCOE JHU Submission to the Voice Privacy Challenge 2024 | Sep 13, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Text-To-Speech Synthesis In The Wild | Sep 13, 2024 | BenchmarkingSpeaker Recognition | —Unverified | 0 |
| Full-text Error Correction for Chinese Speech Recognition with Large Language Model | Sep 12, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT | Sep 11, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis | Sep 11, 2024 | DecoderSpeech Synthesis | CodeCode Available | 2 |
| D-CAPTCHA++: A Study of Resilience of Deepfake CAPTCHA under Transferable Imperceptible Adversarial Attack | Sep 11, 2024 | Adversarial AttackAudio Synthesis | —Unverified | 0 |
| Zero-Shot Text-to-Speech as Golden Speech Generator: A Systematic Framework and its Applicability in Automatic Pronunciation Assessment | Sep 11, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Enhancing Kurdish Text-to-Speech with Native Corpus Training: A High-Quality WaveGlow Vocoder Approach | Sep 10, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| VoiceWukong: Benchmarking Deepfake Voice Detection | Sep 10, 2024 | BenchmarkingFace Swapping | —Unverified | 0 |
| What happens to diffusion model likelihood when your model is conditional? | Sep 10, 2024 | domain classificationmodel | —Unverified | 0 |
| AS-Speech: Adaptive Style For Speech Synthesis | Sep 9, 2024 | RhythmSpeech Synthesis | —Unverified | 0 |
| IndicVoices-R: Unlocking a Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS | Sep 9, 2024 | DenoisingSpeech Enhancement | CodeCode Available | 2 |
| LAST: Language Model Aware Speech Tokenization | Sep 5, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Training Universal Vocoders with Feature Smoothing-Based Augmentation Methods for High-Quality TTS Systems | Sep 4, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| VoxHakka: A Dialectally Diverse Multi-speaker Text-to-Speech System for Taiwanese Hakka | Sep 3, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| A Framework for Synthetic Audio Conversations Generation using Large Language Models | Sep 2, 2024 | Audio ClassificationAudio Tagging | —Unverified | 0 |
| A multilingual training strategy for low resource Text to Speech | Sep 2, 2024 | Cross-Lingual Transfertext-to-speech | —Unverified | 0 |
| Sample-Efficient Diffusion for Text-To-Speech Synthesis | Sep 1, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer | Sep 1, 2024 | Self-Supervised Learningtext-to-speech | CodeCode Available | 9 |
| SelectTTS: Synthesizing Anyone's Voice via Discrete Unit-Based Frame Selection | Aug 30, 2024 | Self-Supervised LearningSpeech Synthesis | —Unverified | 0 |
| AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge | Aug 30, 2024 | DeepFake DetectionFace Swapping | —Unverified | 0 |
| Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model | Aug 30, 2024 | Audio CompressionAudio Generation | CodeCode Available | 3 |
| Multi-modal Adversarial Training for Zero-Shot Voice Cloning | Aug 28, 2024 | Decodertext-to-speech | —Unverified | 0 |
| Easy, Interpretable, Effective: openSMILE for voice deepfake detection | Aug 28, 2024 | DeepFake DetectionFace Swapping | —Unverified | 0 |
| StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech | Aug 27, 2024 | parameter-efficient fine-tuningtext-to-speech | CodeCode Available | 0 |
| DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free Guidance | Aug 26, 2024 | Diversitytext-to-speech | —Unverified | 0 |
| SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models | Aug 25, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Positional Description for Numerical Normalization | Aug 22, 2024 | speech-recognitionSpeech Recognition | —Unverified | 0 |