SOTAVerified

Mathematical Reasoning

Papers

Showing 726750 of 805 papers

TitleStatusHype
Knowledge Distillation of LLM for Automatic Scoring of Science Education Assessments0
Assessing the Impact of Prompting Methods on ChatGPT's Mathematical Capabilities0
GeomVerse: A Systematic Evaluation of Large Models for Geometric Reasoning0
From Good to Great: Improving Math Reasoning with Tool-Augmented Interleaf Prompting0
TinyGSM: achieving >80% on GSM8k with small language models0
Fewer is More: Boosting LLM Reasoning with Reinforced Context Pruning0
Assessing GPT4-V on Structured Reasoning Tasks0
Frugal LMs Trained to Invoke Symbolic Solvers Achieve Parameter-Efficient Arithmetic ReasoningCode0
Universal Self-Consistency for Large Language Model Generation0
LANS: A Layout-Aware Neural Solver for Plane Geometry Problem0
AlignedCoT: Prompting Large Language Models via Native-Speaking DemonstrationsCode0
Orca 2: Teaching Small Language Models How to Reason0
First-Step Advantage: Importance of Starting Right in Multi-Step Math Reasoning0
VerityMath: Advancing Mathematical Reasoning by Self-Verification Through Unit ConsistencyCode0
Let's Reinforce Step by Step0
ATHENA: Mathematical Reasoning with Thought ExpansionCode0
math-PVS: A Large Language Model Framework to Map Scientific Publications to PVS Theories0
MCC-KD: Multi-CoT Consistent Knowledge DistillationCode0
MAF: Multi-Aspect Feedback for Improving Reasoning in Large Language ModelsCode0
Can Large Language Models Explain Themselves? A Study of LLM-Generated Self-Explanations0
TRIGO: Benchmarking Formal Mathematical Proof Reduction for Generative Language ModelsCode0
Let's reward step by step: Step-Level reward model as the Navigators for Reasoning0
DavIR: Data Selection via Implicit Reward for Large Language Models0
KwaiYiiMath: Technical Report0
LLM4DV: Using Large Language Models for Hardware Test Stimuli Generation0
Show:102550
← PrevPage 30 of 33Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAcc94.4Unverified
2DeepSeek-r1Acc79.8Unverified
3Openai-o1Acc74.4Unverified
4Openai-o1-miniAcc70Unverified
5Search-o1Acc56.7Unverified
6s1-32BAcc56.7Unverified
7Openai-o1-previewAcc44.6Unverified
8Qwen2.5-72B-InstructAcc23.3Unverified
9Claude3.5-SonnetAcc16Unverified
#ModelMetricClaimedVerifiedStatus
1o3Accuracy0.25Unverified
2Gemini 1.5 Pro (002)Accuracy0.02Unverified
3GPT-4oAccuracy0.01Unverified
4o1-miniAccuracy0.01Unverified
5o1-previewAccuracy0.01Unverified
6Claude 3.5 SonnetAccuracy0.01Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.6Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.48Unverified
3Neo-P (Fine-tuned, 2.7B)Accuracy0.39Unverified
4GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
5Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.25Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.2Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.59Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.45Unverified
3GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
4Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.27Unverified
5Neo-P (Fine-tuned, 2.7B)Accuracy0.24Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.18Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDCompletion accuracy65.8Unverified
2PGPSNetCompletion accuracy62.7Unverified
3GAPSCompletion accuracy61.2Unverified
4Inter-GPSCompletion accuracy59.8Unverified
5GeoformerCompletion accuracy35.6Unverified
6NGSCompletion accuracy34.1Unverified
#ModelMetricClaimedVerifiedStatus
1QWQ-32B-previewAcc82.5Unverified
2Math-MasterAcc82Unverified
3Qwen2.5-Math-7B-instructAcc62.5Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)75.2Unverified
2GAPSAccuracy (%)67.8Unverified
#ModelMetricClaimedVerifiedStatus
1Search-o1Acc86.4Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)98.5Unverified
#ModelMetricClaimedVerifiedStatus
1GAPSAccuracy (%)97.5Unverified