Are NLP Models really able to Solve Simple Math Word Problems?

2021-03-12NAACL 2021Code Available1· sign in to hype

Arkil Patel, Satwik Bhattamishra, Navin Goyal

Code Available — Be the first to reproduce this paper.

Code

github.com/arkilpatel/SVAMP
OfficialIn paperpytorch★ 139
github.com/debjitpaul/refiner
pytorch★ 74
github.com/vedantgaur/symbolic-mwp-reasoning
none★ 2

Abstract

The problem of designing NLP solvers for math word problems (MWP) has seen sustained research activity and steady gains in the test accuracy. Since existing solvers achieve high performance on the benchmark datasets for elementary level MWPs containing one-unknown arithmetic word problems, such problems are often considered "solved" with the bulk of research attention moving to more complex MWPs. In this paper, we restrict our attention to English MWPs taught in grades four and lower. We provide strong evidence that the existing MWP solvers rely on shallow heuristics to achieve high performance on the benchmark datasets. To this end, we show that MWP solvers that do not have access to the question asked in the MWP can still solve a large fraction of MWPs. Similarly, models that treat MWPs as bag-of-words can also achieve surprisingly high accuracy. Further, we introduce a challenge dataset, SVAMP, created by applying carefully chosen variations over examples sampled from existing datasets. The best accuracy achieved by state-of-the-art models is substantially lower on SVAMP, thus showing that much remains to be done even for the simplest of the MWPs.

Tasks

Math Math Word Problem Solving Math Word Problem SolvingΩ

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
ASDiv-A	GTS with RoBERTa	Execution Accuracy	81.2	—	Unverified
ASDiv-A	LSTM Seq2Seq with RoBERTa	Execution Accuracy	76.9	—	Unverified
ASDiv-A	Graph2Tree with RoBERTa	Execution Accuracy	82.2	—	Unverified
MAWPS	Graph2Tree with RoBERTa	Accuracy (%)	88.7	—	Unverified
MAWPS	GTS with RoBERTa	Accuracy (%)	88.5	—	Unverified
SVAMP	Graph2Tree with RoBERTa	Execution Accuracy	43.8	—	Unverified
SVAMP	GTS with RoBERTa	Execution Accuracy	41	—	Unverified
SVAMP	LSTM Seq2Seq with RoBERTa	Execution Accuracy	40.3	—	Unverified
SVAMP	Transformer with RoBERTa	Execution Accuracy	38.9	—	Unverified

Are NLP Models really able to Solve Simple Math Word Problems?

Code

Abstract

Tasks

Benchmark Results

Reproductions