SOTAVerified

Mathematical Reasoning

Papers

Showing 676700 of 805 papers

TitleStatusHype
OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety0
Optimizing Alignment with Less: Leveraging Data Augmentation for Personalized Evaluation0
Optimizing Numerical Estimation and Operational Efficiency in the Legal Domain through Large Language Models0
Orca 2: Teaching Small Language Models How to Reason0
OSoRA: Output-Dimension and Singular-Value Initialized Low-Rank Adaptation0
PARAMANU-GANITA: Language Model with Mathematical Capabilities0
Parameter-Efficient Checkpoint Merging via Metrics-Weighted Averaging0
Path-Consistency: Prefix Enhancement for Efficient Inference in LLM0
Path Planning for Masked Diffusion Model Sampling0
Pensez: Less Data, Better Reasoning -- Rethinking French LLM0
PhysUniBench: An Undergraduate-Level Physics Reasoning Benchmark for Multimodal Models0
Pi-GPS: Enhancing Geometry Problem Solving by Unleashing the Power of Diagrammatic Information0
Plug-and-Play Training Framework for Preference Optimization0
Policy Guided Tree Search for Enhanced LLM Reasoning0
PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts0
PositionID: LLMs can Control Lengths, Copy and Paste with Explicit Positional Awareness0
PPT: A Process-based Preference Learning Framework for Self Improving Table Question Answering Models0
Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs0
PREMISE: Scalable and Strategic Prompt Optimization for Efficient Mathematical Reasoning in Large Models0
Pre-trained Large Language Models Use Fourier Features to Compute Addition0
Probabilistic Results on the Architecture of Mathematical Reasoning Aligned by Cognitive Alternation0
Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models0
Process or Result? Manipulated Ending Tokens Can Mislead Reasoning LLMs to Ignore the Correct Reasoning Steps0
Progress or Regress? Self-Improvement Reversal in Post-training0
Prompt Selection and Augmentation for Few Examples Code Generation in Large Language Model and its Application in Robotics Control0
Show:102550
← PrevPage 28 of 33Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAcc94.4Unverified
2DeepSeek-r1Acc79.8Unverified
3Openai-o1Acc74.4Unverified
4Openai-o1-miniAcc70Unverified
5s1-32BAcc56.7Unverified
6Search-o1Acc56.7Unverified
7Openai-o1-previewAcc44.6Unverified
8Qwen2.5-72B-InstructAcc23.3Unverified
9Claude3.5-SonnetAcc16Unverified
#ModelMetricClaimedVerifiedStatus
1o3Accuracy0.25Unverified
2Gemini 1.5 Pro (002)Accuracy0.02Unverified
3o1-previewAccuracy0.01Unverified
4GPT-4oAccuracy0.01Unverified
5Claude 3.5 SonnetAccuracy0.01Unverified
6o1-miniAccuracy0.01Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.6Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.48Unverified
3Neo-P (Fine-tuned, 2.7B)Accuracy0.39Unverified
4GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
5Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.25Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.2Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.59Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.45Unverified
3GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
4Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.27Unverified
5Neo-P (Fine-tuned, 2.7B)Accuracy0.24Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.18Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDCompletion accuracy65.8Unverified
2PGPSNetCompletion accuracy62.7Unverified
3GAPSCompletion accuracy61.2Unverified
4Inter-GPSCompletion accuracy59.8Unverified
5GeoformerCompletion accuracy35.6Unverified
6NGSCompletion accuracy34.1Unverified
#ModelMetricClaimedVerifiedStatus
1QWQ-32B-previewAcc82.5Unverified
2Math-MasterAcc82Unverified
3Qwen2.5-Math-7B-instructAcc62.5Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)75.2Unverified
2GAPSAccuracy (%)67.8Unverified
#ModelMetricClaimedVerifiedStatus
1Search-o1Acc86.4Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)98.5Unverified
#ModelMetricClaimedVerifiedStatus
1GAPSAccuracy (%)97.5Unverified