Data Augmentation for Spoken Language Understanding via Pretrained Language Models
Baolin Peng, Chenguang Zhu, Michael Zeng, Jianfeng Gao
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/pengbaolin/soloistpytorch★ 77
Abstract
The training of spoken language understanding (SLU) models often faces the problem of data scarcity. In this paper, we put forward a data augmentation method using pretrained language models to boost the variability and accuracy of generated utterances. Furthermore, we investigate and propose solutions to two previously overlooked semi-supervised learning scenarios of data scarcity in SLU: i) Rich-in-Ontology: ontology information with numerous valid dialogue acts is given; ii) Rich-in-Utterance: a large number of unlabelled utterances are available. Empirical results show that our method can produce synthetic training data that boosts the performance of language understanding models in various scenarios.