SOTAVerified

Math Word Problem Solving

A math word problem is a mathematical exercise (such as in a textbook, worksheet, or exam) where significant background information on the problem is presented in ordinary language rather than in mathematical notation. As most word problems involve a narrative of some sort, they are sometimes referred to as story problems and may vary in the amount of technical language used.

Papers

Showing 150 of 107 papers

TitleStatusHype
Qwen2 Technical ReportCode13
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language ModelsCode9
Llama 2: Open Foundation and Fine-Tuned Chat ModelsCode8
LLaMA: Open and Efficient Foundation Language ModelsCode7
Sparks of Artificial General Intelligence: Early experiments with GPT-4Code6
Mistral 7BCode6
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-InstructCode5
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction DataCode4
Mixtral of ExpertsCode4
Let's Verify Step by StepCode4
Galactica: A Large Language Model for ScienceCode4
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning DatasetCode4
PAL: Program-aided Language ModelsCode3
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem SolvingCode3
AlphaMath Almost Zero: Process Supervision without ProcessCode3
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextCode3
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMsCode3
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical ReasoningCode2
Progressive-Hint Prompting Improves Reasoning in Large Language ModelsCode2
Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General TasksCode2
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-SolvingCode2
MuggleMath: Assessing the Impact of Query and Response Augmentation on Math ReasoningCode2
Multi-View Reasoning: Consistent Contrastive Learning for Math Word ProblemCode2
An Empirical Study of Data Ability Boundary in LLMs' Math ReasoningCode2
DeBERTa: Decoding-enhanced BERT with Disentangled AttentionCode2
An Expression Tree Decoding Strategy for Mathematical Equation GenerationCode2
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language ModelsCode2
Cumulative Reasoning with Large Language ModelsCode2
GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem SolversCode2
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-VerificationCode2
MACM: Utilizing a Multi-Agent System for Condition Mining in Solving Complex Mathematical ProblemsCode2
Measuring Mathematical Problem Solving With the MATH DatasetCode2
Solving Quantitative Reasoning Problems with Language ModelsCode2
Large Language Models are Zero-Shot ReasonersCode2
ELASTIC: Numerical Reasoning with Adaptive Symbolic CompilerCode1
Do Multilingual Language Models Think Better in English?Code1
Ape210K: A Large-Scale and Template-Rich Dataset of Math Word ProblemsCode1
Recall and Learn: A Memory-augmented Solver for Math Word ProblemsCode1
MathChat: Converse to Tackle Challenging Math Problems with LLM AgentsCode1
RetICL: Sequential Retrieval of In-Context Examples with Reinforcement LearningCode1
FinanceMath: Knowledge-Intensive Math Reasoning in Finance DomainsCode1
IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language ReasoningCode1
Automatic Model Selection with Large Language Models for ReasoningCode1
MWP-BERT: Numeracy-Augmented Pre-training for Math Word Problem SolvingCode1
Graph-to-Tree Neural Networks for Learning Structured Input-Output Translation with Applications to Semantic Parsing and Math Word ProblemCode1
Graph-to-Tree Learning for Solving Math Word ProblemsCode1
Automatic Generation of Socratic Subquestions for Teaching Math Word ProblemsCode1
Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word ProblemsCode1
Augmenting Math Word Problems via Iterative Question ComposingCode1
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human AnnotationsCode1
Show:102550
← PrevPage 1 of 3Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Gemini 2.0 Flash ExperimentalAccuracy89.7Unverified
2Qwen2.5-Math-72B-Instruct(TIR,Greedy)Accuracy88.1Unverified
3GPT-4 Turbo (MACM, w/code, voting)Accuracy87.92Unverified
4Qwen2.5-Math-72B-Instruct(COT,Greedy)Accuracy85.9Unverified
5Qwen2.5-Math-7B-Instruct(TIR,Greedy)Accuracy85.2Unverified
6GPT-4-code model (CSV, w/ code, SC, k=16)Accuracy84.3Unverified
7Qwen2-Math-72B-Instruct(greedy)Accuracy84Unverified
8Qwen2.5-Math-7B-Instruct(COT,Greedy)Accuracy83.6Unverified
9Qwen2.5-Math-1.5B-Instruct(TIR,Greedy)Accuracy79.9Unverified
10OpenMath2-Llama3.1-70B (majority@256)Accuracy79.6Unverified
#ModelMetricClaimedVerifiedStatus
1GPT-4 DUPAccuracy94.2Unverified
2GPT-4 (Teaching-Inspired)Execution Accuracy93.9Unverified
3GPT-4 (Model Selection)Execution Accuracy93.7Unverified
4Qwen2(CoT + Code Interpreter)Execution Accuracy92.3Unverified
5GPT-4 (PHP)Execution Accuracy91.9Unverified
6OpenMath-CodeLlama-70B (w/ code)Execution Accuracy87.8Unverified
7MathCoder-L-70BExecution Accuracy84.9Unverified
8PoT_Eng (self-consistency @ 5)Execution Accuracy83.7Unverified
9CoT_Eng (self-consistency @ 5)Execution Accuracy82.5Unverified
10MMOS-CODE-34B(0-shot)Execution Accuracy80.6Unverified
#ModelMetricClaimedVerifiedStatus
1OpenMath-CodeLlama-70B (w/ code)Accuracy (%)95.7Unverified
2MsAT-DeductReasonerAccuracy (%)94.3Unverified
3ATHENA (roberta-large)Accuracy (%)93Unverified
4Exp-TreeAccuracy (%)92.3Unverified
5Multi-viewAccuracy (%)92.3Unverified
6ATHENA (roberta-base)Accuracy (%)92.2Unverified
7Roberta-DeductReasonerAccuracy (%)92Unverified
8DeBERTa (PM + VM)Accuracy (%)91Unverified
9EPTAccuracy (%)88.7Unverified
10Graph2Tree with RoBERTaAccuracy (%)88.7Unverified
#ModelMetricClaimedVerifiedStatus
1GPT-4 (Teaching-Inspired)Accuracy (5-fold)94.3Unverified
2ATHENA (roberta-large)Accuracy (training-test)86.5Unverified
3Multi-view* (ours)Accuracy (5-fold)85.2Unverified
4ATHENA (roberta-base)Accuracy (training-test)84.4Unverified
5Generate and RankAccuracy (5-fold)84.3Unverified
6Exp-TreeAccuracy (5-fold)84.1Unverified
7REAL2: Memory-augmented SolverAccuracy (5-fold)83.18Unverified
8Roberta-DeductReasonerAccuracy (5-fold)83Unverified
9MWP-BERTAccuracy (5-fold)82.4Unverified
10Recall and LearnAccuracy (5-fold)80.8Unverified