emrQA: A Large Corpus for Question Answering on Electronic Medical Records
Anusri Pampari, Preethi Raghavan, Jennifer Liang, Jian Peng
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/panushri25/emrQAOfficialIn papernone★ 153
- github.com/xiangyue9607/CliniRCtf★ 18
- github.com/YIKUAN8/Clinical-Longformernone★ 0
Abstract
We propose a novel methodology to generate domain-specific large-scale question answering (QA) datasets by re-purposing existing annotations for other NLP tasks. We demonstrate an instance of this methodology in generating a large-scale QA dataset for electronic medical records by leveraging existing expert annotations on clinical notes for various NLP tasks from the community shared i2b2 datasets. The resulting corpus (emrQA) has 1 million question-logical form and 400,000+ question-answer evidence pairs. We characterize the dataset and explore its learning potential by training baseline models for question to logical form and question to answer mapping.