SOTAVerified

Math Word Problem Solving

A math word problem is a mathematical exercise (such as in a textbook, worksheet, or exam) where significant background information on the problem is presented in ordinary language rather than in mathematical notation. As most word problems involve a narrative of some sort, they are sometimes referred to as story problems and may vary in the amount of technical language used.

Papers

Showing 125 of 107 papers

TitleStatusHype
A Diversity-Enhanced Knowledge Distillation Model for Practical Math Word Problem SolvingCode0
Learning by Analogy: Enhancing Few-Shot Prompting for Math Word Problem Solving with Computational Graph-Based Retrieval0
SBI-RAG: Enhancing Math Word Problem Solving for Students through Schema-Based Instruction and Retrieval-Augmented GenerationCode0
When Not to Answer: Evaluating Prompts on GPT Models for Effective Abstention in Unanswerable Math Word Problems0
Teaching-Inspired Integrated Prompting Framework: A Novel Approach for Enhancing Reasoning in Large Language ModelsCode0
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction DataCode4
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement0
Qwen2 Technical ReportCode13
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMsCode3
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-SolvingCode2
Program Synthesis Benchmark for Visual Programming in XLogoOnline Environment0
AlphaMath Almost Zero: Process Supervision without ProcessCode3
Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word ProblemsCode1
Toward Self-Improvement of LLMs via Imagination, Searching, and CriticizingCode1
MACM: Utilizing a Multi-Agent System for Condition Mining in Solving Complex Mathematical ProblemsCode2
Data Augmentation with In-Context Learning and Comparative Evaluation in Math Word Problem Solving0
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLMCode0
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextCode3
Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning0
GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem SolversCode2
An Empirical Study of Data Ability Boundary in LLMs' Math ReasoningCode2
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning DatasetCode4
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language ModelsCode9
Augmenting Math Word Problems via Iterative Question ComposingCode1
Mixtral of ExpertsCode4
Show:102550
← PrevPage 1 of 5Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Gemini 2.0 Flash ExperimentalAccuracy89.7Unverified
2Qwen2.5-Math-72B-Instruct(TIR,Greedy)Accuracy88.1Unverified
3GPT-4 Turbo (MACM, w/code, voting)Accuracy87.92Unverified
4Qwen2.5-Math-72B-Instruct(COT,Greedy)Accuracy85.9Unverified
5Qwen2.5-Math-7B-Instruct(TIR,Greedy)Accuracy85.2Unverified
6GPT-4-code model (CSV, w/ code, SC, k=16)Accuracy84.3Unverified
7Qwen2-Math-72B-Instruct(greedy)Accuracy84Unverified
8Qwen2.5-Math-7B-Instruct(COT,Greedy)Accuracy83.6Unverified
9Qwen2.5-Math-1.5B-Instruct(TIR,Greedy)Accuracy79.9Unverified
10OpenMath2-Llama3.1-70B (majority@256)Accuracy79.6Unverified
#ModelMetricClaimedVerifiedStatus
1GPT-4 DUPAccuracy94.2Unverified
2GPT-4 (Teaching-Inspired)Execution Accuracy93.9Unverified
3GPT-4 (Model Selection)Execution Accuracy93.7Unverified
4Qwen2(CoT + Code Interpreter)Execution Accuracy92.3Unverified
5GPT-4 (PHP)Execution Accuracy91.9Unverified
6OpenMath-CodeLlama-70B (w/ code)Execution Accuracy87.8Unverified
7MathCoder-L-70BExecution Accuracy84.9Unverified
8PoT_Eng (self-consistency @ 5)Execution Accuracy83.7Unverified
9CoT_Eng (self-consistency @ 5)Execution Accuracy82.5Unverified
10MMOS-CODE-34B(0-shot)Execution Accuracy80.6Unverified
#ModelMetricClaimedVerifiedStatus
1OpenMath-CodeLlama-70B (w/ code)Accuracy (%)95.7Unverified
2MsAT-DeductReasonerAccuracy (%)94.3Unverified
3ATHENA (roberta-large)Accuracy (%)93Unverified
4Exp-TreeAccuracy (%)92.3Unverified
5Multi-viewAccuracy (%)92.3Unverified
6ATHENA (roberta-base)Accuracy (%)92.2Unverified
7Roberta-DeductReasonerAccuracy (%)92Unverified
8DeBERTa (PM + VM)Accuracy (%)91Unverified
9EPTAccuracy (%)88.7Unverified
10Graph2Tree with RoBERTaAccuracy (%)88.7Unverified
#ModelMetricClaimedVerifiedStatus
1GPT-4 (Teaching-Inspired)Accuracy (5-fold)94.3Unverified
2ATHENA (roberta-large)Accuracy (training-test)86.5Unverified
3Multi-view* (ours)Accuracy (5-fold)85.2Unverified
4ATHENA (roberta-base)Accuracy (training-test)84.4Unverified
5Generate and RankAccuracy (5-fold)84.3Unverified
6Exp-TreeAccuracy (5-fold)84.1Unverified
7REAL2: Memory-augmented SolverAccuracy (5-fold)83.18Unverified
8Roberta-DeductReasonerAccuracy (5-fold)83Unverified
9MWP-BERTAccuracy (5-fold)82.4Unverified
10Recall and LearnAccuracy (5-fold)80.8Unverified