SOTAVerified|Agents Browse Leaderboard About

Arithmetic Reasoning

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 71–80 of 175 papers

Title	Date	Tasks	Status	Hype
An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning	Feb 23, 2024	Arithmetic ReasoningAutomated Theorem Proving	CodeCode Available	2
Distillation Contrastive Decoding: Improving LLMs Reasoning with Contrastive Decoding and Distillation	Feb 21, 2024	Arithmetic ReasoningGSM8K	CodeCode Available	1
SymBa: Symbolic Backward Chaining for Structured Natural Language Reasoning	Feb 20, 2024	Arithmetic ReasoningGSM8K	—Unverified	0
Evaluating LLMs' Mathematical Reasoning in Financial Document Question Answering	Feb 17, 2024	Arithmetic ReasoningMathematical Reasoning	—Unverified	0
Orca-Math: Unlocking the potential of SLMs in Grade School Math	Feb 16, 2024	Arithmetic ReasoningGSM8K	—Unverified	0
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset	Feb 15, 2024	Arithmetic ReasoningGSM8K	CodeCode Available	4
The Unreasonable Effectiveness of Eccentric Automatic Prompts	Feb 9, 2024	Arithmetic ReasoningGSM8K	—Unverified	0
Exploring Group and Symmetry Principles in Large Language Models	Feb 9, 2024	Arithmetic ReasoningNegation	—Unverified	0
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models	Feb 5, 2024	Arithmetic ReasoningMath	CodeCode Available	9
Evaluating Gender Bias in Large Language Models via Chain-of-Thought Prompting	Jan 28, 2024	Arithmetic ReasoningFact Checking	—Unverified	0

Show:10 25 50

← PrevPage 8 of 18Next →

All datasets GSM8K MultiArith Game of 24 MathMC MathToF

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Claude 3.5 Sonnet (HPT)	Accuracy	97.72	—	Unverified
2	DUP prompt upon GPT-4	Accuracy	97.1	—	Unverified
3	Qwen2-Math-72B-Instruct (greedy)	Accuracy	96.7	—	Unverified
4	SFT-Mistral-7B (Metamath, OVM, Smart Ensemble)	Accuracy	96.4	—	Unverified
5	OpenMath2-Llama3.1-70B (majority@256)	Accuracy	96	—	Unverified
6	Jiutian-大模型	Accuracy	95.2	—	Unverified
7	DAMOMath-7B(MetaMath, OVM, BS, Ensemble)	Accuracy	95.1	—	Unverified
8	Claude 3 Opus (0-shot chain-of-thought)	Accuracy	95	—	Unverified
9	OpenMath2-Llama3.1-70B	Accuracy	94.9	—	Unverified
10	GPT-4 (Teaching-Inspired)	Accuracy	94.8	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Text-davinci-002 (175B)(zero-shot-cot)	Accuracy	78.7	—	Unverified
2	Text-davinci-002 (175B) (zero-shot)	Accuracy	17.7	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Tree of Thoughts (b=5)	Success	0.74	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 (Teaching-Inspired)	Accuracy	92.2	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 (Teaching-Inspired)	Accuracy	89.2	—	Unverified