PIQA: Reasoning about Physical Commonsense in Natural Language
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, Yejin Choi
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/vered1986/self_talkpytorch★ 79
- github.com/AkariAsai/logic_guided_qapytorch★ 71
Abstract
To apply eyeshadow without a brush, should I use a cotton swab or a toothpick? Questions requiring this kind of physical commonsense pose a challenge to today's natural language understanding systems. While recent pretrained models (such as BERT) have made progress on question answering over more abstract domains - such as news articles and encyclopedia entries, where text is plentiful - in more physical domains, text is inherently limited due to reporting bias. Can AI systems learn to reliably answer physical common-sense questions without experiencing the physical world? In this paper, we introduce the task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA. Though humans find the dataset easy (95% accuracy), large pretrained models struggle (77%). We provide analysis about the dimensions of knowledge that existing models lack, which offers significant opportunities for future research.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| PIQA | RoBERTa-large 355M (fine-tuned) | Accuracy | 77.1 | — | Unverified |
| PIQA | GPT-2-small 124M (fine-tuned) | Accuracy | 69.2 | — | Unverified |
| PIQA | BERT-large 340M (fine-tuned) | Accuracy | 66.8 | — | Unverified |
| PIQA | Random chance baseline | Accuracy | 50 | — | Unverified |