CST5: Data augmentation for Code-Switched Semantic Parsing
Anonymous
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
Extending semantic parsers to code-switched input has been a challenging problem, primarily due to lack of labeled data for supervision. In this work, we introduce CST5, a new data augmentation technique that finetunes a T5 model using a small seed set (100 utterances) to generate code-switched utterances from English utterances. We demonstrate the effectiveness of CST5 by comparing baseline models which are trained without data augmentation to models which are trained with augmented data for varying amount of training data. By using CST5, one can achieve the same semantic parsing performance by using up to 20x less labeled data. To aid further research, we release over 10k human annotated Hindi-English (Hinglish) code-switched utterances along with 170K CST5 generated code-switched utterances from the TOPv2 dataset. The generated data is of good quality deemed over 98\% natural and 89\% semantically equivalent by native annotators.