Math Word Problem Solving

A math word problem is a mathematical exercise (such as in a textbook, worksheet, or exam) where significant background information on the problem is presented in ordinary language rather than in mathematical notation. As most word problems involve a narrative of some sort, they are sometimes referred to as story problems and may vary in the amount of technical language used.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 31–40 of 107 papers

Title	Date	Tasks	Status	Hype
Solving Quantitative Reasoning Problems with Language Models	Jun 29, 2022	Arithmetic ReasoningLanguage Modeling	CodeCode Available	2
Large Language Models are Zero-Shot Reasoners	May 24, 2022	Arithmetic ReasoningCommon Sense Reasoning	CodeCode Available	2
Measuring Mathematical Problem Solving With the MATH Dataset	Mar 5, 2021	MathMathematical Problem-Solving	CodeCode Available	2
DeBERTa: Decoding-enhanced BERT with Disentangled Attention	Jun 5, 2020	Common Sense ReasoningCoreference Resolution	CodeCode Available	2
Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems	Apr 23, 2024	Arithmetic ReasoningGSM8K	CodeCode Available	1
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing	Apr 18, 2024	Arithmetic ReasoningGSM8K	CodeCode Available	1
Augmenting Math Word Problems via Iterative Question Composing	Jan 17, 2024	MathMathematical Reasoning	CodeCode Available	1
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations	Dec 14, 2023	Arithmetic ReasoningGSM8K	CodeCode Available	1
FinanceMath: Knowledge-Intensive Math Reasoning in Finance Domains	Nov 16, 2023	MathMath Word Problem Solving	CodeCode Available	1
Do Multilingual Language Models Think Better in English?	Aug 2, 2023	Common Sense ReasoningCross-Lingual Natural Language Inference	CodeCode Available	1

Show:10 25 50

← PrevPage 4 of 11Next →

All datasets MATH SVAMP MAWPS Math23K ALG514 ASDiv-A ParaMAWPS DRAW-1K MathQA SVAMP (1:N)GSM-Plus MATH minival

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Gemini 2.0 Flash Experimental	Accuracy	89.7	—	Unverified
2	Qwen2.5-Math-72B-Instruct(TIR,Greedy)	Accuracy	88.1	—	Unverified
3	GPT-4 Turbo (MACM, w/code, voting)	Accuracy	87.92	—	Unverified
4	Qwen2.5-Math-72B-Instruct(COT,Greedy)	Accuracy	85.9	—	Unverified
5	Qwen2.5-Math-7B-Instruct(TIR,Greedy)	Accuracy	85.2	—	Unverified
6	GPT-4-code model (CSV, w/ code, SC, k=16)	Accuracy	84.3	—	Unverified
7	Qwen2-Math-72B-Instruct(greedy)	Accuracy	84	—	Unverified
8	Qwen2.5-Math-7B-Instruct(COT,Greedy)	Accuracy	83.6	—	Unverified
9	Qwen2.5-Math-1.5B-Instruct(TIR,Greedy)	Accuracy	79.9	—	Unverified
10	OpenMath2-Llama3.1-70B (majority@256)	Accuracy	79.6	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 DUP	Accuracy	94.2	—	Unverified
2	GPT-4 (Teaching-Inspired)	Execution Accuracy	93.9	—	Unverified
3	GPT-4 (Model Selection)	Execution Accuracy	93.7	—	Unverified
4	Qwen2(CoT + Code Interpreter)	Execution Accuracy	92.3	—	Unverified
5	GPT-4 (PHP)	Execution Accuracy	91.9	—	Unverified
6	OpenMath-CodeLlama-70B (w/ code)	Execution Accuracy	87.8	—	Unverified
7	MathCoder-L-70B	Execution Accuracy	84.9	—	Unverified
8	PoT_Eng (self-consistency @ 5)	Execution Accuracy	83.7	—	Unverified
9	CoT_Eng (self-consistency @ 5)	Execution Accuracy	82.5	—	Unverified
10	MMOS-CODE-34B(0-shot)	Execution Accuracy	80.6	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	OpenMath-CodeLlama-70B (w/ code)	Accuracy (%)	95.7	—	Unverified
2	MsAT-DeductReasoner	Accuracy (%)	94.3	—	Unverified
3	ATHENA (roberta-large)	Accuracy (%)	93	—	Unverified
4	Exp-Tree	Accuracy (%)	92.3	—	Unverified
5	Multi-view	Accuracy (%)	92.3	—	Unverified
6	ATHENA (roberta-base)	Accuracy (%)	92.2	—	Unverified
7	Roberta-DeductReasoner	Accuracy (%)	92	—	Unverified
8	DeBERTa (PM + VM)	Accuracy (%)	91	—	Unverified
9	EPT	Accuracy (%)	88.7	—	Unverified
10	Graph2Tree with RoBERTa	Accuracy (%)	88.7	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 (Teaching-Inspired)	Accuracy (5-fold)	94.3	—	Unverified
2	ATHENA (roberta-large)	Accuracy (training-test)	86.5	—	Unverified
3	Multi-view* (ours)	Accuracy (5-fold)	85.2	—	Unverified
4	ATHENA (roberta-base)	Accuracy (training-test)	84.4	—	Unverified
5	Generate and Rank	Accuracy (5-fold)	84.3	—	Unverified
6	Exp-Tree	Accuracy (5-fold)	84.1	—	Unverified
7	REAL2: Memory-augmented Solver	Accuracy (5-fold)	83.18	—	Unverified
8	Roberta-DeductReasoner	Accuracy (5-fold)	83	—	Unverified
9	MWP-BERT	Accuracy (5-fold)	82.4	—	Unverified
10	Recall and Learn	Accuracy (5-fold)	80.8	—	Unverified