MTLM: Incorporating Bidirectional Text Information to Enhance Language Model Training in Speech Recognition Systems

2025-02-14Unverified0· sign in to hype

Qingliang Meng, Pengju Ren, Tian Li, Changsong Dai, HuiZhi Liang

Unverified — Be the first to reproduce this paper.

Abstract

Automatic speech recognition (ASR) systems normally consist of an acoustic model (AM) and a language model (LM). The acoustic model estimates the probability distribution of text given the input speech, while the language model calibrates this distribution toward a specific knowledge domain to produce the final transcription. Traditional ASR-specific LMs are typically trained in a unidirectional (left-to-right) manner to align with autoregressive decoding. However, this restricts the model from leveraging the right-side context during training, limiting its representational capacity. In this work, we propose MTLM, a novel training paradigm that unifies unidirectional and bidirectional manners through 3 training objectives: ULM, BMLM, and UMLM. This approach enhances the LM's ability to capture richer linguistic patterns from both left and right contexts while preserving compatibility with standard ASR autoregressive decoding methods. As a result, the MTLM model not only enhances the ASR system's performance but also support multiple decoding strategies, including shallow fusion, unidirectional/bidirectional n-best rescoring. Experiments on the LibriSpeech dataset show that MTLM consistently outperforms unidirectional training across multiple decoding strategies, highlighting its effectiveness and flexibility in ASR applications.

Tasks

Automatic Speech Recognition Automatic Speech Recognition (ASR)Language Modeling Language Modelling speech-recognition Speech Recognition

MTLM: Incorporating Bidirectional Text Information to Enhance Language Model Training in Speech Recognition Systems

Abstract

Tasks

Reproductions