How Reasonable are Common-Sense Reasoning Tasks: A Case-Study on the Winograd Schema Challenge and SWAG

2018-11-05IJCNLP 2019Code Available0· sign in to hype

Paul Trichelair, Ali Emami, Adam Trischler, Kaheer Suleman, Jackie Chi Kit Cheung

Code Available — Be the first to reproduce this paper.

Code

github.com/ptrichel/How-Reasonable-are-Common-Sense-Reasoning-Tasks
OfficialIn papernone★ 0

Abstract

Recent studies have significantly improved the state-of-the-art on common-sense reasoning (CSR) benchmarks like the Winograd Schema Challenge (WSC) and SWAG. The question we ask in this paper is whether improved performance on these benchmarks represents genuine progress towards common-sense-enabled systems. We make case studies of both benchmarks and design protocols that clarify and qualify the results of previous work by analyzing threats to the validity of previous experimental designs. Our protocols account for several properties prevalent in common-sense benchmarks including size limitations, structural regularities, and variable instance difficulty.

Tasks

Common Sense Reasoning Coreference Resolution

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
Winograd Schema Challenge	GPT-2 Medium 774M (partial scoring)	Accuracy	69.2	—	Unverified
Winograd Schema Challenge	GPT-2 Medium 774M (full scoring)	Accuracy	64.5	—	Unverified
Winograd Schema Challenge	GPT-2 Small 117M (partial scoring)	Accuracy	61.5	—	Unverified
Winograd Schema Challenge	GPT-2 Small 117M (full scoring)	Accuracy	55.7	—	Unverified

How Reasonable are Common-Sense Reasoning Tasks: A Case-Study on the Winograd Schema Challenge and SWAG

Code

Abstract

Tasks

Benchmark Results

Reproductions