| EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector | Nov 4, 2024 | DecoderEmotional Speech Synthesis | CodeCode Available | 2 |
| Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody? | Oct 31, 2024 | Rhythmspeech-recognition | —Unverified | 0 |
| Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis | Oct 30, 2024 | Speech Synthesistext-to-speech | CodeCode Available | 2 |
| Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech | Oct 29, 2024 | Decodertext-to-speech | CodeCode Available | 0 |
| Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding | Oct 29, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| RDSinger: Reference-based Diffusion Network for Singing Voice Synthesis | Oct 29, 2024 | DenoisingSinging Voice Synthesis | —Unverified | 0 |
| Audio Deepfake Detection with Self-Supervised XLS-R and SLS Classifier | Oct 28, 2024 | Audio Deepfake DetectionAudio Generation | CodeCode Available | 2 |
| Asynchronous Tool Usage for Real-Time Agents | Oct 28, 2024 | Automatic Speech Recognitionspeech-recognition | —Unverified | 0 |
| Mitigating Unauthorized Speech Synthesis for Voice Protection | Oct 28, 2024 | Data AugmentationFace Swapping | CodeCode Available | 1 |
| Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation | Oct 27, 2024 | parameter-efficient fine-tuningQuestion Answering | —Unverified | 0 |
| Making Social Platforms Accessible: Emotion-Aware Speech Generation with Integrated Text Analysis | Oct 24, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Evaluating and Improving Automatic Speech Recognition Systems for Korean Meteorological Experts | Oct 24, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| STTATTS: Unified Speech-To-Text And Text-To-Speech Model | Oct 24, 2024 | Multi-Task Learningspeech-recognition | CodeCode Available | 1 |
| ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams | Oct 23, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap | Oct 22, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Continuous Speech Tokenizer in Text To Speech | Oct 22, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Continuous Speech Synthesis using per-token Latent Diffusion | Oct 21, 2024 | Image GenerationQuantization | —Unverified | 0 |
| A Unified Framework for Collecting Text-to-Speech Synthesis Datasets for 22 Indian Languages | Oct 18, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech | Oct 18, 2024 | object-detectionObject Detection | CodeCode Available | 0 |
| DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis | Oct 17, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Enhancing Crowdsourced Audio for Text-to-Speech Models | Oct 17, 2024 | Denoisingtext-to-speech | —Unverified | 0 |
| Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation | Oct 17, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech | Oct 17, 2024 | DisentanglementQuantization | —Unverified | 0 |
| ERVQ: Enhanced Residual Vector Quantization with Intra-and-Inter-Codebook Optimization for Neural Audio Codecs | Oct 16, 2024 | DiversityOnline Clustering | —Unverified | 0 |
| IsoChronoMeter: A simple and effective isochronic translation evaluation metric | Oct 14, 2024 | Machine Translationtext-to-speech | CodeCode Available | 0 |
| DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis | Oct 14, 2024 | DenoisingSpeaker Verification | —Unverified | 0 |
| Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling | Oct 12, 2024 | text-to-speechText to Speech | CodeCode Available | 0 |
| Unsupervised Data Validation Methods for Efficient Model Training | Oct 10, 2024 | Data Augmentationmodel | —Unverified | 0 |
| Efficient training strategies for natural sounding speech synthesis and speaker adaptation based on FastPitch | Oct 9, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching | Oct 9, 2024 | Denoisingtext-to-speech | CodeCode Available | 11 |
| Can DeepFake Speech be Reliably Detected? | Oct 9, 2024 | Face SwappingMisinformation | —Unverified | 0 |
| Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS | Oct 9, 2024 | DiversitySpeech Synthesis | —Unverified | 0 |
| SegINR: Segment-wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech | Oct 7, 2024 | Computational Efficiencytext-to-speech | —Unverified | 0 |
| HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis | Oct 6, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Where are we in audio deepfake detection? A systematic analysis over generative and detection models | Oct 6, 2024 | Audio Deepfake DetectionAudio Synthesis | CodeCode Available | 1 |
| Adversarial Attacks and Robust Defenses in Speaker Embedding based Zero-Shot Text-to-Speech System | Oct 5, 2024 | Adversarial PurificationSpeech Synthesis | —Unverified | 0 |
| Textless Streaming Speech-to-Speech Translation using Semantic Speech Tokens | Oct 4, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Generative Semantic Communication for Text-to-Speech Synthesis | Oct 4, 2024 | QuantizationSemantic Communication | —Unverified | 0 |
| MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech | Oct 4, 2024 | DisentanglementSpeech Synthesis | —Unverified | 0 |
| Recent Advances in Speech Language Models: A Survey | Oct 1, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | CodeCode Available | 2 |
| EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control | Oct 1, 2024 | Emotional Speech SynthesisSpeech Synthesis | CodeCode Available | 2 |
| Augmentation through Laundering Attacks for Audio Spoof Detection | Oct 1, 2024 | Data AugmentationFace Swapping | —Unverified | 0 |
| Accent conversion using discrete units with parallel data synthesized from controllable accented TTS | Sep 30, 2024 | Data AugmentationSpeech Synthesis | —Unverified | 0 |
| Word-wise intonation model for cross-language TTS systems | Sep 30, 2024 | Dynamic Time WarpingProsody Prediction | —Unverified | 0 |
| FluentEditor2: Text-based Speech Editing by Modeling Multi-Scale Acoustic and Prosody Consistency | Sep 28, 2024 | Text to Speech | CodeCode Available | 0 |
| Description-based Controllable Text-to-Speech with Cross-Lingual Voice Control | Sep 26, 2024 | Self-Supervised Learningtext-to-speech | —Unverified | 0 |
| Exploring synthetic data for cross-speaker style transfer in style representation based TTS | Sep 25, 2024 | Style Transfertext-to-speech | —Unverified | 0 |
| Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions | Sep 25, 2024 | AttributeDimensionality Reduction | —Unverified | 0 |
| Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation | Sep 25, 2024 | text-to-speechText to Speech | CodeCode Available | 5 |
| StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis | Sep 24, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |