SOTAVerified

ROBUST SPEECH COMMAND RECOGNITION USING LABEL-DRIVEN TIME-FREQUENCY MASKING

2018-10-22Unverified0· sign in to hype

Anonymous

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

Speech enhancement driven robust Automatic Speech Recognition (ASR) systems typically require parallel corpus with noisy and clean speech utterances for training. Moreover, many studies have reported that such front-ends, even though improve speech quality, do not always improve the recognition performance. On the other hand, the multi-condition training of ASR systems have little visualization or interpretability capabilities of how these systems achieve robustness. In this paper, we propose a novel neural architecture with unified enhancement and sequence classification block, that is trained in an end-to-end manner only using noisy speech without having information of clean speech. The enhancement block is a fully convolutional network that is designed to perform Time Frequency (T-F) masking like operation, followed by an LSTM sequence classification block. The T-F masking formulation enables visualization of learned mask and helps us to visualize the T-F points important for classification of a speech command. Experiments performed on Google Speech Command dataset show that our proposed network achieves better results than the baseline model without an enhancement front-end.

Tasks

Reproductions