SOTAVerified

Mathematical Reasoning

Papers

Showing 576600 of 805 papers

TitleStatusHype
Aligning Tutor Discourse Supporting Rigorous Thinking with Tutee Content Mastery for Predicting Math Achievement0
LLMs can Find Mathematical Reasoning Mistakes by Pedagogical Chain-of-Thought0
VisionGraph: Leveraging Large Multimodal Models for Graph Theory Problems in Visual ContextCode1
AlphaMath Almost Zero: Process Supervision without ProcessCode3
Exploring the Compositional Deficiency of Large Language Models in Mathematical ReasoningCode2
GOLD: Geometry Problem Solver with Natural Language DescriptionCode1
A Careful Examination of Large Language Model Performance on Grade School Arithmetic0
Exploring the Limits of Fine-grained LLM-based Physics Inference via Premise Removal Interventions0
Benchmarking Benchmark Leakage in Large Language ModelsCode2
Describe-then-Reason: Improving Multimodal Mathematical Reasoning through Visual Comprehension Training0
PARAMANU-GANITA: Language Model with Mathematical Capabilities0
Pre-Calc: Learning to Use the Calculator Improves Numeracy in Language ModelsCode0
iTBLS: A Dataset of Interactive Conversations Over Tabular Information0
Toward Self-Improvement of LLMs via Imagination, Searching, and CriticizingCode1
Enhancing Length Extrapolation in Sequential Models with Pointer-Augmented Neural Memory0
Paraphrase and Solve: Exploring and Exploiting the Impact of Surface Form on Mathematical Reasoning in Large Language ModelsCode0
Self-Explore: Enhancing Mathematical Reasoning in Language Models with Fine-grained RewardsCode2
Compression Represents Intelligence LinearlyCode2
Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy CompetitionCode0
Evaluating Mathematical Reasoning Beyond AccuracyCode2
SAAS: Solving Ability Amplification Strategy for Enhanced Mathematical Reasoning in Large Language Models0
Exploring the Mystery of Influential Data for Mathematical Reasoning0
Can LLMs Master Math? Investigating Large Language Models on Math Stack ExchangeCode0
Planning and Editing What You Retrieve for Enhanced Tool LearningCode0
Dual Instruction Tuning with Large Language Models for Mathematical Reasoning0
Show:102550
← PrevPage 24 of 33Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAcc94.4Unverified
2DeepSeek-r1Acc79.8Unverified
3Openai-o1Acc74.4Unverified
4Openai-o1-miniAcc70Unverified
5Search-o1Acc56.7Unverified
6s1-32BAcc56.7Unverified
7Openai-o1-previewAcc44.6Unverified
8Qwen2.5-72B-InstructAcc23.3Unverified
9Claude3.5-SonnetAcc16Unverified
#ModelMetricClaimedVerifiedStatus
1o3Accuracy0.25Unverified
2Gemini 1.5 Pro (002)Accuracy0.02Unverified
3GPT-4oAccuracy0.01Unverified
4o1-miniAccuracy0.01Unverified
5o1-previewAccuracy0.01Unverified
6Claude 3.5 SonnetAccuracy0.01Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.6Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.48Unverified
3Neo-P (Fine-tuned, 2.7B)Accuracy0.39Unverified
4GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
5Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.25Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.2Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.59Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.45Unverified
3GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
4Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.27Unverified
5Neo-P (Fine-tuned, 2.7B)Accuracy0.24Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.18Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDCompletion accuracy65.8Unverified
2PGPSNetCompletion accuracy62.7Unverified
3GAPSCompletion accuracy61.2Unverified
4Inter-GPSCompletion accuracy59.8Unverified
5GeoformerCompletion accuracy35.6Unverified
6NGSCompletion accuracy34.1Unverified
#ModelMetricClaimedVerifiedStatus
1QWQ-32B-previewAcc82.5Unverified
2Math-MasterAcc82Unverified
3Qwen2.5-Math-7B-instructAcc62.5Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)75.2Unverified
2GAPSAccuracy (%)67.8Unverified
#ModelMetricClaimedVerifiedStatus
1Search-o1Acc86.4Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)98.5Unverified
#ModelMetricClaimedVerifiedStatus
1GAPSAccuracy (%)97.5Unverified