UnifiedQA: Crossing Format Boundaries With a Single QA System

2020-05-02Findings of the Association for Computational LinguisticsCode Available1· sign in to hype

Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, Hannaneh Hajishirzi

Code Available — Be the first to reproduce this paper.

Code

github.com/allenai/unifiedqa
OfficialIn paperpytorch★ 445
github.com/facebookresearch/metaicl
pytorch★ 273

Abstract

Question answering (QA) tasks have been posed using a variety of formats, such as extractive span selection, multiple choice, etc. This has led to format-specialized models, and even to an implicit division in the QA community. We argue that such boundaries are artificial and perhaps unnecessary, given the reasoning abilities we seek to teach are not governed by the format. As evidence, we use the latest advances in language modeling to build a single pre-trained QA model, UnifiedQA, that performs surprisingly well across 17 QA datasets spanning 4 diverse formats. UnifiedQA performs on par with 9 different models that were trained on individual datasets themselves. Even when faced with 12 unseen datasets of observed formats, UnifiedQA performs surprisingly well, showing strong generalization from its out-of-format training data. Finally, simply fine-tuning this pre-trained QA model into specialized models results in a new state of the art on 6 datasets, establishing UnifiedQA as a strong starting point for building QA systems.

Tasks

Common Sense Reasoning Language Modeling Language Modelling Multiple-choice Multi-task Language Understanding Multi-Task Learning Question Answering

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
CommonsenseQA	BART-large 440M (fine-tuned)	Accuracy	62.5	—	Unverified
CommonsenseQA	UnifiedQA 11B (fine-tuned)	Accuracy	79.1	—	Unverified
CommonsenseQA	T5-XXL 11B (fine-tuned)	Accuracy	78.1	—	Unverified
CommonsenseQA	UnifiedQA 11B (zero-shot)	Accuracy	76.2	—	Unverified
CommonsenseQA	UnifiedQA 440M (fine-tuned)	Accuracy	64	—	Unverified
WinoGrande	UnifiedQA 11B (fine-tuned)	Accuracy	89.4	—	Unverified
WinoGrande	Unified QA 406M (fine-tuned)	Accuracy	73.3	—	Unverified

UnifiedQA: Crossing Format Boundaries With a Single QA System

Code

Abstract

Tasks

Benchmark Results

Reproductions