SOTAVerified

Mathematical Reasoning

Papers

Showing 251300 of 805 papers

TitleStatusHype
Learning to Check: Unleashing Potentials for Self-Correction in Large Language ModelsCode1
Lila: A Unified Benchmark for Mathematical ReasoningCode1
Climbing the Ladder of Reasoning: What LLMs Can-and Still Can't-Solve after SFT?Code1
GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language ModelsCode1
Learning From Failure: Integrating Negative Examples when Fine-tuning Large Language Models as AgentsCode1
Learning From Mistakes Makes LLM Better ReasonerCode1
Semi-Supervised Learning via Weight-aware Distillation under Class Distribution MismatchCode1
CoMAT: Chain of Mathematically Annotated Thought Improves Mathematical ReasoningCode1
R-PRM: Reasoning-Driven Process Reward ModelingCode1
Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal VerificationCode1
GeoQA: A Geometric Question Answering Benchmark Towards Multimodal Numerical ReasoningCode1
Alice: Proactive Learning with Teacher's Demonstrations for Weak-to-Strong GeneralizationCode1
KTAE: A Model-Free Algorithm to Key-Tokens Advantage Estimation in Mathematical ReasoningCode1
Unlocking Reasoning Potential in Large Langauge Models by Scaling Code-form PlanningCode1
JiuZhang3.0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis ModelsCode1
Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic ReasoningCode1
MAPO: Advancing Multilingual Reasoning through Multilingual Alignment-as-Preference OptimizationCode1
Large Language Models for Multi-Robot Systems: A SurveyCode1
LIME: Learning Inductive Bias for Primitives of Mathematical ReasoningCode1
GOLD: Geometry Problem Solver with Natural Language DescriptionCode1
Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning ModelsCode1
Separate the Wheat from the Chaff: Model Deficiency Unlearning via Parameter-Efficient Module OperationCode1
UniGeo: Unifying Geometry Logical Reasoning via Reformulating Mathematical ExpressionCode1
FRoG: Evaluating Fuzzy Reasoning of Generalized Quantifiers in Large Language ModelsCode0
A Survey on Mathematical Reasoning and Optimization with Large Language ModelsCode0
Rethinking Fine-Tuning when Scaling Test-Time Compute: Limiting Confidence Improves Mathematical ReasoningCode0
Mathematical Formalized Problem Solving and Theorem Proving in Different Fields in Lean 4Code0
Reasoning with Transformer-based Models: Deep Learning, but Shallow ReasoningCode0
Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language ModelsCode0
Reverse Operation based Data Augmentation for Solving Math Word ProblemsCode0
CER: Confidence Enhanced Reasoning in LLMsCode0
PSPO*: An Effective Process-supervised Policy Optimization for Reasoning AlignmentCode0
AI-Assisted Generation of Difficult Math QuestionsCode0
A Survey of Deep Learning for Geometry Problem SolvingCode0
Reasoning over Uncertain Text by Generative Large Language ModelsCode0
Explanation Selection Using Unlabeled Data for Chain-of-Thought PromptingCode0
Probability-Consistent Preference Optimization for Enhanced LLM ReasoningCode0
Polymath: A Challenging Multi-modal Mathematical Reasoning BenchmarkCode0
Planning and Editing What You Retrieve for Enhanced Tool LearningCode0
Process-based Self-Rewarding Language ModelsCode0
Can LLMs Solve longer Math Word Problems Better?Code0
Pride and Prejudice: LLM Amplifies Self-Bias in Self-RefinementCode0
Paraphrase and Solve: Exploring and Exploiting the Impact of Surface Form on Mathematical Reasoning in Large Language ModelsCode0
Can LLMs Master Math? Investigating Large Language Models on Math Stack ExchangeCode0
Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and CorrectionCode0
Omni-DPO: A Dual-Perspective Paradigm for Dynamic Preference Learning of LLMsCode0
Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical SupervisionCode0
NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI TutorsCode0
Can A Gamer Train A Mathematical Reasoning Model?Code0
Not All Votes Count! Programs as Verifiers Improve Self-Consistency of Language Models for Math ReasoningCode0
Show:102550
← PrevPage 6 of 17Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAcc94.4Unverified
2DeepSeek-r1Acc79.8Unverified
3Openai-o1Acc74.4Unverified
4Openai-o1-miniAcc70Unverified
5Search-o1Acc56.7Unverified
6s1-32BAcc56.7Unverified
7Openai-o1-previewAcc44.6Unverified
8Qwen2.5-72B-InstructAcc23.3Unverified
9Claude3.5-SonnetAcc16Unverified
#ModelMetricClaimedVerifiedStatus
1o3Accuracy0.25Unverified
2Gemini 1.5 Pro (002)Accuracy0.02Unverified
3GPT-4oAccuracy0.01Unverified
4o1-miniAccuracy0.01Unverified
5o1-previewAccuracy0.01Unverified
6Claude 3.5 SonnetAccuracy0.01Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.6Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.48Unverified
3Neo-P (Fine-tuned, 2.7B)Accuracy0.39Unverified
4GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
5Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.25Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.2Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.59Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.45Unverified
3GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
4Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.27Unverified
5Neo-P (Fine-tuned, 2.7B)Accuracy0.24Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.18Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDCompletion accuracy65.8Unverified
2PGPSNetCompletion accuracy62.7Unverified
3GAPSCompletion accuracy61.2Unverified
4Inter-GPSCompletion accuracy59.8Unverified
5GeoformerCompletion accuracy35.6Unverified
6NGSCompletion accuracy34.1Unverified
#ModelMetricClaimedVerifiedStatus
1QWQ-32B-previewAcc82.5Unverified
2Math-MasterAcc82Unverified
3Qwen2.5-Math-7B-instructAcc62.5Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)75.2Unverified
2GAPSAccuracy (%)67.8Unverified
#ModelMetricClaimedVerifiedStatus
1Search-o1Acc86.4Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)98.5Unverified
#ModelMetricClaimedVerifiedStatus
1GAPSAccuracy (%)97.5Unverified