Math Word Problem Solving
A math word problem is a mathematical exercise (such as in a textbook, worksheet, or exam) where significant background information on the problem is presented in ordinary language rather than in mathematical notation. As most word problems involve a narrative of some sort, they are sometimes referred to as story problems and may vary in the amount of technical language used.
Papers
Showing 1–10 of 107 papers
Benchmark Results
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Gemini 2.0 Flash Experimental | Accuracy | 89.7 | — | Unverified |
| 2 | Qwen2.5-Math-72B-Instruct(TIR,Greedy) | Accuracy | 88.1 | — | Unverified |
| 3 | GPT-4 Turbo (MACM, w/code, voting) | Accuracy | 87.92 | — | Unverified |
| 4 | Qwen2.5-Math-72B-Instruct(COT,Greedy) | Accuracy | 85.9 | — | Unverified |
| 5 | Qwen2.5-Math-7B-Instruct(TIR,Greedy) | Accuracy | 85.2 | — | Unverified |
| 6 | GPT-4-code model (CSV, w/ code, SC, k=16) | Accuracy | 84.3 | — | Unverified |
| 7 | Qwen2-Math-72B-Instruct(greedy) | Accuracy | 84 | — | Unverified |
| 8 | Qwen2.5-Math-7B-Instruct(COT,Greedy) | Accuracy | 83.6 | — | Unverified |
| 9 | Qwen2.5-Math-1.5B-Instruct(TIR,Greedy) | Accuracy | 79.9 | — | Unverified |
| 10 | OpenMath2-Llama3.1-70B (majority@256) | Accuracy | 79.6 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | GPT-4 DUP | Accuracy | 94.2 | — | Unverified |
| 2 | GPT-4 (Teaching-Inspired) | Execution Accuracy | 93.9 | — | Unverified |
| 3 | GPT-4 (Model Selection) | Execution Accuracy | 93.7 | — | Unverified |
| 4 | Qwen2(CoT + Code Interpreter) | Execution Accuracy | 92.3 | — | Unverified |
| 5 | GPT-4 (PHP) | Execution Accuracy | 91.9 | — | Unverified |
| 6 | OpenMath-CodeLlama-70B (w/ code) | Execution Accuracy | 87.8 | — | Unverified |
| 7 | MathCoder-L-70B | Execution Accuracy | 84.9 | — | Unverified |
| 8 | PoT_Eng (self-consistency @ 5) | Execution Accuracy | 83.7 | — | Unverified |
| 9 | CoT_Eng (self-consistency @ 5) | Execution Accuracy | 82.5 | — | Unverified |
| 10 | MMOS-CODE-34B(0-shot) | Execution Accuracy | 80.6 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | OpenMath-CodeLlama-70B (w/ code) | Accuracy (%) | 95.7 | — | Unverified |
| 2 | MsAT-DeductReasoner | Accuracy (%) | 94.3 | — | Unverified |
| 3 | ATHENA (roberta-large) | Accuracy (%) | 93 | — | Unverified |
| 4 | Exp-Tree | Accuracy (%) | 92.3 | — | Unverified |
| 5 | Multi-view | Accuracy (%) | 92.3 | — | Unverified |
| 6 | ATHENA (roberta-base) | Accuracy (%) | 92.2 | — | Unverified |
| 7 | Roberta-DeductReasoner | Accuracy (%) | 92 | — | Unverified |
| 8 | DeBERTa (PM + VM) | Accuracy (%) | 91 | — | Unverified |
| 9 | EPT | Accuracy (%) | 88.7 | — | Unverified |
| 10 | Graph2Tree with RoBERTa | Accuracy (%) | 88.7 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | GPT-4 (Teaching-Inspired) | Accuracy (5-fold) | 94.3 | — | Unverified |
| 2 | ATHENA (roberta-large) | Accuracy (training-test) | 86.5 | — | Unverified |
| 3 | Multi-view* (ours) | Accuracy (5-fold) | 85.2 | — | Unverified |
| 4 | ATHENA (roberta-base) | Accuracy (training-test) | 84.4 | — | Unverified |
| 5 | Generate and Rank | Accuracy (5-fold) | 84.3 | — | Unverified |
| 6 | Exp-Tree | Accuracy (5-fold) | 84.1 | — | Unverified |
| 7 | REAL2: Memory-augmented Solver | Accuracy (5-fold) | 83.18 | — | Unverified |
| 8 | Roberta-DeductReasoner | Accuracy (5-fold) | 83 | — | Unverified |
| 9 | MWP-BERT | Accuracy (5-fold) | 82.4 | — | Unverified |
| 10 | Recall and Learn | Accuracy (5-fold) | 80.8 | — | Unverified |