Deep Speech Synthesis from Articulatory Features

2022-01-16ACL ARR January 2022Unverified0· sign in to hype

Anonymous

Unverified — Be the first to reproduce this paper.

Abstract

In the articulatory synthesis task, speech is synthesized from input features containing information about the physical behavior of the human vocal tract. This task provides a promising direction for speech synthesis research, as the articulatory space is compact, smooth, and interpretable. Current works have highlighted the potential for deep learning models to perform articulatory synthesis. However, it remains unclear whether these models can achieve the efficiency and fidelity of the human speech production system. To help bridge this gap, we propose a time-domain articulatory synthesis methodology and demonstrate its efficacy with both electromagnetic articulography (EMA) and synthetic articulatory feature inputs. Our model is both computationally efficient and highly intelligible, achieving a transcription word error rate (WER) of 7.14\% for the EMA-to-speech task. Through interpolation experiments, we also highlight the generalizability and interpretability of our approach.

Tasks

Speech Synthesis

Deep Speech Synthesis from Articulatory Features

Abstract

Tasks

Reproductions