SOTAVerified

Mathematical Reasoning

Papers

Showing 251300 of 805 papers

TitleStatusHype
HARDMath: A Benchmark Dataset for Challenging Problems in Applied MathematicsCode1
Let's Verify Math Questions Step by StepCode1
Learning From Mistakes Makes LLM Better ReasonerCode1
ReflectionCoder: Learning from Reflection Sequence for Enhanced One-off Code GenerationCode1
Augmenting Math Word Problems via Iterative Question ComposingCode1
Learning Multi-Step Reasoning by Solving Arithmetic TasksCode1
Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal VerificationCode1
GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language ModelsCode1
Large Language Models for Multi-Robot Systems: A SurveyCode1
Learning From Failure: Integrating Negative Examples when Fine-tuning Large Language Models as AgentsCode1
Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning ModelsCode1
GOLD: Geometry Problem Solver with Natural Language DescriptionCode1
CoMAT: Chain of Mathematically Annotated Thought Improves Mathematical ReasoningCode1
GeoQA: A Geometric Question Answering Benchmark Towards Multimodal Numerical ReasoningCode1
KTAE: A Model-Free Algorithm to Key-Tokens Advantage Estimation in Mathematical ReasoningCode1
Alice: Proactive Learning with Teacher's Demonstrations for Weak-to-Strong GeneralizationCode1
R-PRM: Reasoning-Driven Process Reward ModelingCode1
Unlocking Reasoning Potential in Large Langauge Models by Scaling Code-form PlanningCode1
Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic ReasoningCode1
JiuZhang3.0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis ModelsCode1
Lila: A Unified Benchmark for Mathematical ReasoningCode1
Rewriting Pre-Training Data Boosts LLM Performance in Math and CodeCode1
Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided InterventionsCode1
FRoG: Evaluating Fuzzy Reasoning of Generalized Quantifiers in Large Language ModelsCode0
A Survey on Mathematical Reasoning and Optimization with Large Language ModelsCode0
Reasoning with Transformer-based Models: Deep Learning, but Shallow ReasoningCode0
Mathematical Formalized Problem Solving and Theorem Proving in Different Fields in Lean 4Code0
Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language ModelsCode0
PSPO*: An Effective Process-supervised Policy Optimization for Reasoning AlignmentCode0
CER: Confidence Enhanced Reasoning in LLMsCode0
Probability-Consistent Preference Optimization for Enhanced LLM ReasoningCode0
AI-Assisted Generation of Difficult Math QuestionsCode0
Procedural Knowledge in Pretraining Drives Reasoning in Large Language ModelsCode0
A Survey of Deep Learning for Geometry Problem SolvingCode0
Process-based Self-Rewarding Language ModelsCode0
Planning and Editing What You Retrieve for Enhanced Tool LearningCode0
Explanation Selection Using Unlabeled Data for Chain-of-Thought PromptingCode0
Polymath: A Challenging Multi-modal Mathematical Reasoning BenchmarkCode0
Pride and Prejudice: LLM Amplifies Self-Bias in Self-RefinementCode0
Can LLMs Solve longer Math Word Problems Better?Code0
Overcoming Barriers to Skill Injection in Language Modeling: Case Study in ArithmeticCode0
Paraphrase and Solve: Exploring and Exploiting the Impact of Surface Form on Mathematical Reasoning in Large Language ModelsCode0
Reasoning over Uncertain Text by Generative Large Language ModelsCode0
Can LLMs Master Math? Investigating Large Language Models on Math Stack ExchangeCode0
On-Policy RL with Optimal Reward BaselineCode0
Omni-DPO: A Dual-Perspective Paradigm for Dynamic Preference Learning of LLMsCode0
Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and CorrectionCode0
Not All Votes Count! Programs as Verifiers Improve Self-Consistency of Language Models for Math ReasoningCode0
NUMCoT: Numerals and Units of Measurement in Chain-of-Thought Reasoning using Large Language ModelsCode0
Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical SupervisionCode0
Show:102550
← PrevPage 6 of 17Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAcc94.4Unverified
2DeepSeek-r1Acc79.8Unverified
3Openai-o1Acc74.4Unverified
4Openai-o1-miniAcc70Unverified
5s1-32BAcc56.7Unverified
6Search-o1Acc56.7Unverified
7Openai-o1-previewAcc44.6Unverified
8Qwen2.5-72B-InstructAcc23.3Unverified
9Claude3.5-SonnetAcc16Unverified
#ModelMetricClaimedVerifiedStatus
1o3Accuracy0.25Unverified
2Gemini 1.5 Pro (002)Accuracy0.02Unverified
3o1-previewAccuracy0.01Unverified
4GPT-4oAccuracy0.01Unverified
5Claude 3.5 SonnetAccuracy0.01Unverified
6o1-miniAccuracy0.01Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.6Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.48Unverified
3Neo-P (Fine-tuned, 2.7B)Accuracy0.39Unverified
4GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
5Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.25Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.2Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.59Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.45Unverified
3GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
4Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.27Unverified
5Neo-P (Fine-tuned, 2.7B)Accuracy0.24Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.18Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDCompletion accuracy65.8Unverified
2PGPSNetCompletion accuracy62.7Unverified
3GAPSCompletion accuracy61.2Unverified
4Inter-GPSCompletion accuracy59.8Unverified
5GeoformerCompletion accuracy35.6Unverified
6NGSCompletion accuracy34.1Unverified
#ModelMetricClaimedVerifiedStatus
1QWQ-32B-previewAcc82.5Unverified
2Math-MasterAcc82Unverified
3Qwen2.5-Math-7B-instructAcc62.5Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)75.2Unverified
2GAPSAccuracy (%)67.8Unverified
#ModelMetricClaimedVerifiedStatus
1Search-o1Acc86.4Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)98.5Unverified
#ModelMetricClaimedVerifiedStatus
1GAPSAccuracy (%)97.5Unverified