SALM: Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation

2023-10-13Code Available0· sign in to hype

Zhehuai Chen, He Huang, Andrei Andrusenko, Oleksii Hrinchuk, Krishna C. Puvvada, Jason Li, Subhankar Ghosh, Jagadeesh Balam, Boris Ginsburg

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/NVIDIA/NeMo
OfficialIn paperpytorch★ 16,967

Abstract

We present a novel Speech Augmented Language Model (SALM) with multitask and in-context learning capabilities. SALM comprises a frozen text LLM, a audio encoder, a modality adapter module, and LoRA layers to accommodate speech input and associated task instructions. The unified SALM not only achieves performance on par with task-specific Conformer baselines for Automatic Speech Recognition (ASR) and Speech Translation (AST), but also exhibits zero-shot in-context learning capabilities, demonstrated through keyword-boosting task for ASR and AST. Moreover, speech supervised in-context training is proposed to bridge the gap between LLM training and downstream speech tasks, which further boosts the in-context learning ability of speech-to-text models. Proposed model is open-sourced via NeMo toolkit.

Tasks

Automatic Speech Recognition Automatic Speech Recognition (ASR)In-Context Learning Language Modeling Language Modelling speech-recognition Speech Recognition Speech-to-Text

SALM: Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation

Code

Abstract

Tasks

Reproductions