Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition

2023-10-10Code Available2· sign in to hype

Srijith Radhakrishnan, Chao-Han Huck Yang, Sumeer Ahmad Khan, Rohit Kumar, Narsis A. Kiani, David Gomez-Cabrero, Jesper N. Tegner

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/srijith-rkr/whispering-llama
OfficialIn paperpytorch★ 269

Abstract

We introduce a new cross-modal fusion technique designed for generative error correction in automatic speech recognition (ASR). Our methodology leverages both acoustic information and external linguistic representations to generate accurate speech transcription contexts. This marks a step towards a fresh paradigm in generative error correction within the realm of n-best hypotheses. Unlike the existing ranking-based rescoring methods, our approach adeptly uses distinct initialization techniques and parameter-efficient algorithms to boost ASR performance derived from pre-trained speech and text models. Through evaluation across diverse ASR datasets, we evaluate the stability and reproducibility of our fusion technique, demonstrating its improved word error rate relative (WERR) performance in comparison to n-best hypotheses by relatively 37.66%. To encourage future research, we have made our code and pre-trained models open source at https://github.com/Srijith-rkr/Whispering-LLaMA.

Tasks

Automatic Speech Recognition Automatic Speech Recognition (ASR)speech-recognition Speech Recognition

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
LRS2	Whisper-LLaMA	Test WER	6.6	—	Unverified

Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition

Code

Abstract

Tasks

Benchmark Results

Reproductions