ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation

2023-05-24NeurIPS 2023Code Available1· sign in to hype

Chenyang Le, Yao Qian, Long Zhou, Shujie Liu, Yanmin Qian, Michael Zeng, Xuedong Huang

Code Available — Be the first to reproduce this paper.

Code

github.com/nethermanpro/comsl
OfficialIn paperpytorch★ 11

Abstract

Joint speech-language training is challenging due to the large demand for training data and GPU consumption, as well as the modality gap between speech and language. We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models and optimized data-efficiently for spoken language tasks. Particularly, we propose to incorporate cross-modality learning into transfer learning and conduct them simultaneously for downstream tasks in a multi-task learning manner. Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks, achieving a new state-of-the-art average BLEU score of 31.5 on the multilingual speech to English text translation task for 21 languages, as measured on the public CoVoST2 evaluation set.

Tasks

GPU Language Modeling Language Modelling Multi-Task Learning Speech-to-Text Speech-to-Text Translation Transfer Learning Translation

ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation

Code

Abstract

Tasks

Reproductions