Arithmetic Reasoning

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–50 of 175 papers

Title	Date	Tasks	Status	Hype	Score
Qwen2 Technical Report	Jul 15, 2024	Arithmetic ReasoningGSM8K	CodeCode Available	13	5
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models	Feb 5, 2024	Arithmetic ReasoningMath	CodeCode Available	9	5
Llama 2: Open Foundation and Fine-Tuned Chat Models	Jul 18, 2023	Arithmetic Reasoning	CodeCode Available	8	5
LLaMA: Open and Efficient Foundation Language Models	Feb 27, 2023	Arithmetic ReasoningCode Generation	CodeCode Available	7	5
Sparks of Artificial General Intelligence: Early experiments with GPT-4	Mar 22, 2023	Arithmetic ReasoningMathematical Reasoning	CodeCode Available	6	5
Mistral 7B	Oct 10, 2023	answerability predictionArithmetic Reasoning	CodeCode Available	6	5
GPT-4 Technical Report	Mar 15, 2023	answerability predictionArithmetic Reasoning	CodeCode Available	6	5
Tree of Thoughts: Deliberate Problem Solving with Large Language Models	May 17, 2023	Arithmetic ReasoningDecision Making	CodeCode Available	5	5
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct	Aug 18, 2023	Arithmetic ReasoningGSM8K	CodeCode Available	5	5
ReFT: Representation Finetuning for Language Models	Apr 4, 2024	Arithmetic Reasoning	CodeCode Available	5	5
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data	Oct 2, 2024	Arithmetic ReasoningLarge Language Model	CodeCode Available	4	5
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset	Feb 15, 2024	Arithmetic ReasoningGSM8K	CodeCode Available	4	5
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving	Sep 29, 2023	Arithmetic ReasoningComputational Efficiency	CodeCode Available	3	5
WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks	Jul 7, 2024	Arithmetic Reasoning	CodeCode Available	3	5
LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models	Apr 4, 2023	Arithmetic ReasoningLanguage Modelling	CodeCode Available	3	5
Llemma: An Open Language Model For Mathematics	Oct 16, 2023	Arithmetic ReasoningAutomated Theorem Proving	CodeCode Available	3	5
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs	Jun 26, 2024	Arithmetic ReasoningGSM8K	CodeCode Available	3	5
Reasoning with Language Model Prompting: A Survey	Dec 19, 2022	Arithmetic ReasoningCommon Sense Reasoning	CodeCode Available	3	5
PAL: Program-aided Language Models	Nov 18, 2022	Arithmetic ReasoningGSM8K	CodeCode Available	3	5
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving	Jun 18, 2024	Arithmetic ReasoningMath	CodeCode Available	2	5
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models	Sep 21, 2023	Arithmetic ReasoningGSM8K	CodeCode Available	2	5
Large Language Models are Zero-Shot Reasoners	May 24, 2022	Arithmetic ReasoningCommon Sense Reasoning	CodeCode Available	2	5
Is ChatGPT a General-Purpose Natural Language Processing Task Solver?	Feb 8, 2023	Arithmetic ReasoningZero-Shot Learning	CodeCode Available	2	5
CAPO: Cost-Aware Prompt Optimization	Apr 22, 2025	Arithmetic ReasoningAutoML	CodeCode Available	2	5
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate	May 30, 2023	Arithmetic ReasoningMachine Translation	CodeCode Available	2	5
An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning	Feb 23, 2024	Arithmetic ReasoningAutomated Theorem Proving	CodeCode Available	2	5
Solving Quantitative Reasoning Problems with Language Models	Jun 29, 2022	Arithmetic ReasoningLanguage Modeling	CodeCode Available	2	5
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning	Oct 5, 2023	Arithmetic ReasoningGSM8K	CodeCode Available	2	5
MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning	Oct 9, 2023	Arithmetic ReasoningData Augmentation	CodeCode Available	2	5
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models	Aug 3, 2023	Arithmetic ReasoningGSM8K	CodeCode Available	2	5
Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks	Jan 5, 2024	Arithmetic ReasoningCode Generation	CodeCode Available	2	5
Progressive-Hint Prompting Improves Reasoning in Large Language Models	Apr 19, 2023	Arithmetic ReasoningGSM8K	CodeCode Available	2	5
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification	Aug 15, 2023	Arithmetic ReasoningMath	CodeCode Available	2	5
Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs	Jun 13, 2024	Arithmetic ReasoningFact Verification	CodeCode Available	2	5
Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for Ensembling	Jun 18, 2024	Arithmetic ReasoningLanguage Modeling	CodeCode Available	2	5
Boosting Language Models Reasoning with Chain-of-Knowledge Prompting	Jun 10, 2023	Arithmetic Reasoning	CodeCode Available	1	5
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations	Dec 14, 2023	Arithmetic ReasoningGSM8K	CodeCode Available	1	5
An Investigation of Neuron Activation as a Unified Lens to Explain Chain-of-Thought Eliciting Arithmetic Reasoning of LLMs	Jun 18, 2024	Arithmetic Reasoning	CodeCode Available	1	5
Batch Prompting: Efficient Inference with Large Language Model APIs	Jan 19, 2023	Arithmetic ReasoningIn-Context Learning	CodeCode Available	1	5
Language Imbalance Driven Rewarding for Multilingual Self-improving	Oct 11, 2024	Arithmetic ReasoningInstruction Following	CodeCode Available	1	5
Automatic Prompt Augmentation and Selection with Chain-of-Thought from Labeled Data	Feb 24, 2023	Arithmetic ReasoningLanguage Modelling	CodeCode Available	1	5
Large Language Models are Better Reasoners with Self-Verification	Dec 19, 2022	Arithmetic ReasoningCommon Sense Reasoning	CodeCode Available	1	5
Fed-SB: A Silver Bullet for Extreme Communication Efficiency and Performance in (Private) Federated LoRA Fine-Tuning	Feb 21, 2025	Arithmetic Reasoning	CodeCode Available	1	5
Automatic Model Selection with Large Language Models for Reasoning	May 23, 2023	Arithmetic ReasoningGSM8K	CodeCode Available	1	5
Is the Reversal Curse a Binding Problem? Uncovering Limitations of Transformers from a Basic Generalization Failure	Apr 2, 2025	Arithmetic ReasoningData Augmentation	CodeCode Available	1	5
Large Language Models Can Be Easily Distracted by Irrelevant Context	Jan 31, 2023	Arithmetic ReasoningLanguage Modeling	CodeCode Available	1	5
FedEx-LoRA: Exact Aggregation for Federated and Efficient Fine-Tuning of Foundation Models	Oct 12, 2024	Arithmetic ReasoningFederated Learning	CodeCode Available	1	5
Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles	Jun 18, 2024	Arithmetic ReasoningCode Generation	CodeCode Available	1	5
Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems	Apr 23, 2024	Arithmetic ReasoningGSM8K	CodeCode Available	1	5
HALO: Hierarchical Autonomous Logic-Oriented Orchestration for Multi-Agent LLM Systems	May 17, 2025	Arithmetic ReasoningCode Generation	CodeCode Available	1	5

Show:10 25 50

← PrevPage 1 of 4Next →

All datasets GSM8K MultiArith Game of 24 MathMC MathToF

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Claude 3.5 Sonnet (HPT)	Accuracy	97.72	—	Unverified
2	DUP prompt upon GPT-4	Accuracy	97.1	—	Unverified
3	Qwen2-Math-72B-Instruct (greedy)	Accuracy	96.7	—	Unverified
4	SFT-Mistral-7B (Metamath, OVM, Smart Ensemble)	Accuracy	96.4	—	Unverified
5	OpenMath2-Llama3.1-70B (majority@256)	Accuracy	96	—	Unverified
6	Jiutian-大模型	Accuracy	95.2	—	Unverified
7	DAMOMath-7B(MetaMath, OVM, BS, Ensemble)	Accuracy	95.1	—	Unverified
8	Claude 3 Opus (0-shot chain-of-thought)	Accuracy	95	—	Unverified
9	OpenMath2-Llama3.1-70B	Accuracy	94.9	—	Unverified
10	GPT-4 (Teaching-Inspired)	Accuracy	94.8	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Text-davinci-002 (175B)(zero-shot-cot)	Accuracy	78.7	—	Unverified
2	Text-davinci-002 (175B) (zero-shot)	Accuracy	17.7	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tree of Thoughts (b=5)	Success	0.74	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 (Teaching-Inspired)	Accuracy	92.2	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 (Teaching-Inspired)	Accuracy	89.2	—	Unverified