Synthetic Data Generation for Multilingual Domain-Adaptable Question Answering Systems

2022-06-01EAMT 2022Unverified0· sign in to hype

Alina Kramchaninova, Arne Defauw

Unverified — Be the first to reproduce this paper.

Abstract

Deep learning models have significantly advanced the state of the art of question answering systems. However, the majority of datasets available for training such models have been annotated by humans, are open-domain, and are composed primarily in English. To deal with these limitations, we introduce a pipeline that creates synthetic data from natural text. To illustrate the domain-adaptability of our approach, as well as its multilingual potential, we use our pipeline to obtain synthetic data in English and Dutch. We combine the synthetic data with non-synthetic data (SQuAD 2.0) and evaluate multilingual BERT models on the question answering task. Models trained with synthetically augmented data demonstrate a clear improvement in performance when evaluated on the domain-specific test set, compared to the models trained exclusively on SQuAD 2.0. We expect our work to be beneficial for training domain-specific question-answering systems when the amount of available data is limited.

Tasks

Question Answering Synthetic Data Generation

Synthetic Data Generation for Multilingual Domain-Adaptable Question Answering Systems

Abstract

Tasks

Reproductions