SOTAVerified

Syllable Subword Tokens for Open Vocabulary Speech Recognition in Malayalam

2023-01-17Code Available0· sign in to hype

Kavya Manohar, A. R. Jayan, Rajeev Rajan

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

In a hybrid automatic speech recognition (ASR) system, a pronunciation lexicon (PL) and a language model (LM) are essential to correctly retrieve spoken word sequences. Being a morphologically complex language, the vocabulary of Malayalam is so huge and it is impossible to build a PL and an LM that cover all diverse word forms. Usage of subword tokens to build PL and LM, and combining them to form words after decoding, enables the recovery of many out of vocabulary words. In this work we investigate the impact of using syllables as subword tokens instead of words in Malayalam ASR, and evaluate the relative improvement in lexicon size, model memory requirement and word error rate.

Tasks

Reproductions