Bridging the Granularity Gap for Acoustic Modeling

2023-05-27Code Available1· sign in to hype

Chen Xu, Yuhao Zhang, Chengbo Jiao, Xiaoqian Liu, Chi Hu, Xin Zeng, Tong Xiao, Anxiang Ma, Huizhen Wang, Jingbo Zhu

Code Available — Be the first to reproduce this paper.

Code

github.com/xuchennlp/s2t
OfficialIn paperpytorch★ 12

Abstract

While Transformer has become the de-facto standard for speech, modeling upon the fine-grained frame-level features remains an open challenge of capturing long-distance dependencies and distributing the attention weights. We propose Progressive Down-Sampling (PDS) which gradually compresses the acoustic features into coarser-grained units containing more complete semantic information, like text-level representation. In addition, we develop a representation fusion method to alleviate information loss that occurs inevitably during high compression. In this way, we compress the acoustic features into 1/32 of the initial length while achieving better or comparable performances on the speech recognition task. And as a bonus, it yields inference speedups ranging from 1.20 to 1.47. By reducing the modeling burden, we also achieve competitive results when training on the more challenging speech translation task.

Tasks

speech-recognition Speech Recognition

Bridging the Granularity Gap for Acoustic Modeling

Code

Abstract

Tasks

Reproductions