| Continuous Speech Synthesis using per-token Latent Diffusion | Oct 21, 2024 | Image GenerationQuantization | —Unverified | 0 |
| Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech | Oct 18, 2024 | object-detectionObject Detection | CodeCode Available | 0 |
| A Unified Framework for Collecting Text-to-Speech Synthesis Datasets for 22 Indian Languages | Oct 18, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Enhancing Crowdsourced Audio for Text-to-Speech Models | Oct 17, 2024 | Denoisingtext-to-speech | —Unverified | 0 |
| DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis | Oct 17, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech | Oct 17, 2024 | DisentanglementQuantization | —Unverified | 0 |
| Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation | Oct 17, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| ERVQ: Enhanced Residual Vector Quantization with Intra-and-Inter-Codebook Optimization for Neural Audio Codecs | Oct 16, 2024 | DiversityOnline Clustering | —Unverified | 0 |
| IsoChronoMeter: A simple and effective isochronic translation evaluation metric | Oct 14, 2024 | Machine Translationtext-to-speech | CodeCode Available | 0 |
| DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis | Oct 14, 2024 | DenoisingSpeaker Verification | —Unverified | 0 |
| Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling | Oct 12, 2024 | text-to-speechText to Speech | CodeCode Available | 0 |
| Unsupervised Data Validation Methods for Efficient Model Training | Oct 10, 2024 | Data Augmentationmodel | —Unverified | 0 |
| Can DeepFake Speech be Reliably Detected? | Oct 9, 2024 | Face SwappingMisinformation | —Unverified | 0 |
| Efficient training strategies for natural sounding speech synthesis and speaker adaptation based on FastPitch | Oct 9, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS | Oct 9, 2024 | DiversitySpeech Synthesis | —Unverified | 0 |
| SegINR: Segment-wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech | Oct 7, 2024 | Computational Efficiencytext-to-speech | —Unverified | 0 |
| HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis | Oct 6, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Adversarial Attacks and Robust Defenses in Speaker Embedding based Zero-Shot Text-to-Speech System | Oct 5, 2024 | Adversarial PurificationSpeech Synthesis | —Unverified | 0 |
| Textless Streaming Speech-to-Speech Translation using Semantic Speech Tokens | Oct 4, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech | Oct 4, 2024 | DisentanglementSpeech Synthesis | —Unverified | 0 |
| Generative Semantic Communication for Text-to-Speech Synthesis | Oct 4, 2024 | QuantizationSemantic Communication | —Unverified | 0 |
| Augmentation through Laundering Attacks for Audio Spoof Detection | Oct 1, 2024 | Data AugmentationFace Swapping | —Unverified | 0 |
| Accent conversion using discrete units with parallel data synthesized from controllable accented TTS | Sep 30, 2024 | Data AugmentationSpeech Synthesis | —Unverified | 0 |
| Word-wise intonation model for cross-language TTS systems | Sep 30, 2024 | Dynamic Time WarpingProsody Prediction | —Unverified | 0 |
| FluentEditor2: Text-based Speech Editing by Modeling Multi-Scale Acoustic and Prosody Consistency | Sep 28, 2024 | Text to Speech | CodeCode Available | 0 |
| Description-based Controllable Text-to-Speech with Cross-Lingual Voice Control | Sep 26, 2024 | Self-Supervised Learningtext-to-speech | —Unverified | 0 |
| Exploring synthetic data for cross-speaker style transfer in style representation based TTS | Sep 25, 2024 | Style Transfertext-to-speech | —Unverified | 0 |
| Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions | Sep 25, 2024 | AttributeDimensionality Reduction | —Unverified | 0 |
| StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis | Sep 24, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Beyond Text-to-Text: An Overview of Multimodal and Generative Artificial Intelligence for Education Using Topic Modeling | Sep 24, 2024 | Articlestext-to-speech | —Unverified | 0 |
| Facial Expression-Enhanced TTS: Combining Face Representation and Emotion Intensity for Adaptive Speech | Sep 24, 2024 | Emotional Speech SynthesisSpeech Synthesis | —Unverified | 0 |
| On the Feasibility of Fully AI-automated Vishing Attacks | Sep 20, 2024 | Large Language ModelSpeech-to-Text | —Unverified | 0 |
| Zero-shot Cross-lingual Voice Transfer for TTS | Sep 20, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Enhancing Synthetic Training Data for Speech Commands: From ASR-Based Filtering to Domain Adaptation in SSL Latent Space | Sep 19, 2024 | Automatic Speech RecognitionData Augmentation | —Unverified | 0 |
| Preference Alignment Improves Language Model-Based TTS | Sep 19, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference | Sep 18, 2024 | Audio CompressionLanguage Modeling | —Unverified | 0 |
| DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech | Sep 18, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Exploring an Inter-Pausal Unit (IPU) based Approach for Indic End-to-End TTS Systems | Sep 18, 2024 | Sentencetext-to-speech | —Unverified | 0 |
| SpoofCeleb: Speech Deepfake Detection and SASV In The Wild | Sep 18, 2024 | DeepFake DetectionDiversity | —Unverified | 0 |
| The Art of Storytelling: Multi-Agent Generative AI for Dynamic Multimodal Narratives | Sep 17, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| SpMis: An Investigation of Synthetic Spoken Misinformation Detection | Sep 17, 2024 | Misinformationtext-to-speech | —Unverified | 0 |
| Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora | Sep 17, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization | Sep 16, 2024 | Emotional Speech SynthesisIn-Context Learning | —Unverified | 0 |
| StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion | Sep 16, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Acquiring Pronunciation Knowledge from Transcribed Speech Audio via Multi-task Learning | Sep 15, 2024 | Multi-Task Learningtext-to-speech | —Unverified | 0 |
| Improving Robustness of Diffusion-Based Zero-Shot Speech Synthesis via Stable Formant Generation | Sep 14, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| E1 TTS: Simple and Fast Non-Autoregressive TTS | Sep 14, 2024 | Denoisingtext-to-speech | —Unverified | 0 |
| Text-To-Speech Synthesis In The Wild | Sep 13, 2024 | BenchmarkingSpeaker Recognition | —Unverified | 0 |
| AccentBox: Towards High-Fidelity Zero-Shot Accent Generation | Sep 13, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| HLTCOE JHU Submission to the Voice Privacy Challenge 2024 | Sep 13, 2024 | text-to-speechText to Speech | —Unverified | 0 |