SOTAVerified

CST5: Data augmentation for Code-Switched Semantic Parsing

2021-11-16ACL ARR November 2021Unverified0· sign in to hype

Anonymous

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

Extending semantic parsers to code-switched input has been a challenging problem, primarily due to lack of labeled data for supervision. In this work, we introduce CST5, a new data augmentation technique that finetunes a T5 model using a small seed set (100 utterances) to generate code-switched utterances from English utterances. We demonstrate the effectiveness of CST5 by comparing baseline models which are trained without data augmentation to models which are trained with augmented data for varying amount of training data. By using CST5, one can achieve the same semantic parsing performance by using up to 20x less labeled data. To aid further research, we release over 10k human annotated Hindi-English (Hinglish) code-switched utterances along with 170K CST5 generated code-switched utterances from the TOPv2 dataset. The generated data is of good quality deemed over 98\% natural and 89\% semantically equivalent by native annotators.

Tasks

Reproductions