SOTAVerified

Mathematical Reasoning

Papers

Showing 276300 of 805 papers

TitleStatusHype
Rethinking Fine-Tuning when Scaling Test-Time Compute: Limiting Confidence Improves Mathematical ReasoningCode0
Mathematical Formalized Problem Solving and Theorem Proving in Different Fields in Lean 4Code0
Reasoning with Transformer-based Models: Deep Learning, but Shallow ReasoningCode0
Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language ModelsCode0
Reverse Operation based Data Augmentation for Solving Math Word ProblemsCode0
CER: Confidence Enhanced Reasoning in LLMsCode0
PSPO*: An Effective Process-supervised Policy Optimization for Reasoning AlignmentCode0
AI-Assisted Generation of Difficult Math QuestionsCode0
A Survey of Deep Learning for Geometry Problem SolvingCode0
Reasoning over Uncertain Text by Generative Large Language ModelsCode0
Explanation Selection Using Unlabeled Data for Chain-of-Thought PromptingCode0
Probability-Consistent Preference Optimization for Enhanced LLM ReasoningCode0
Polymath: A Challenging Multi-modal Mathematical Reasoning BenchmarkCode0
Planning and Editing What You Retrieve for Enhanced Tool LearningCode0
Process-based Self-Rewarding Language ModelsCode0
Can LLMs Solve longer Math Word Problems Better?Code0
Pride and Prejudice: LLM Amplifies Self-Bias in Self-RefinementCode0
Paraphrase and Solve: Exploring and Exploiting the Impact of Surface Form on Mathematical Reasoning in Large Language ModelsCode0
Can LLMs Master Math? Investigating Large Language Models on Math Stack ExchangeCode0
Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and CorrectionCode0
Omni-DPO: A Dual-Perspective Paradigm for Dynamic Preference Learning of LLMsCode0
Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical SupervisionCode0
NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI TutorsCode0
Can A Gamer Train A Mathematical Reasoning Model?Code0
Not All Votes Count! Programs as Verifiers Improve Self-Consistency of Language Models for Math ReasoningCode0
Show:102550
← PrevPage 12 of 33Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAcc94.4Unverified
2DeepSeek-r1Acc79.8Unverified
3Openai-o1Acc74.4Unverified
4Openai-o1-miniAcc70Unverified
5Search-o1Acc56.7Unverified
6s1-32BAcc56.7Unverified
7Openai-o1-previewAcc44.6Unverified
8Qwen2.5-72B-InstructAcc23.3Unverified
9Claude3.5-SonnetAcc16Unverified
#ModelMetricClaimedVerifiedStatus
1o3Accuracy0.25Unverified
2Gemini 1.5 Pro (002)Accuracy0.02Unverified
3GPT-4oAccuracy0.01Unverified
4o1-miniAccuracy0.01Unverified
5o1-previewAccuracy0.01Unverified
6Claude 3.5 SonnetAccuracy0.01Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.6Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.48Unverified
3Neo-P (Fine-tuned, 2.7B)Accuracy0.39Unverified
4GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
5Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.25Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.2Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.59Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.45Unverified
3GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
4Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.27Unverified
5Neo-P (Fine-tuned, 2.7B)Accuracy0.24Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.18Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDCompletion accuracy65.8Unverified
2PGPSNetCompletion accuracy62.7Unverified
3GAPSCompletion accuracy61.2Unverified
4Inter-GPSCompletion accuracy59.8Unverified
5GeoformerCompletion accuracy35.6Unverified
6NGSCompletion accuracy34.1Unverified
#ModelMetricClaimedVerifiedStatus
1QWQ-32B-previewAcc82.5Unverified
2Math-MasterAcc82Unverified
3Qwen2.5-Math-7B-instructAcc62.5Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)75.2Unverified
2GAPSAccuracy (%)67.8Unverified
#ModelMetricClaimedVerifiedStatus
1Search-o1Acc86.4Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)98.5Unverified
#ModelMetricClaimedVerifiedStatus
1GAPSAccuracy (%)97.5Unverified