SOTAVerified

Mathematical Reasoning

Papers

Showing 326350 of 805 papers

TitleStatusHype
Evaluating Large Vision-and-Language Models on Children's Mathematical Olympiads0
Can Language Models Rival Mathematics Students? Evaluating Mathematical Reasoning through Textual Manipulation and Human Experiments0
LLMs can Find Mathematical Reasoning Mistakes by Pedagogical Chain-of-Thought0
LLMs can implicitly learn from mistakes in-context0
Evaluating Grounded Reasoning by Code-Assisted Large Language Models for Mathematics0
A Comprehensive Evaluation of Large Language Models on Temporal Event Forecasting0
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection0
Error Classification of Large Language Models on Math Word Problems: A Dynamically Adaptive Framework0
Evaluating LLMs' Mathematical Reasoning in Financial Document Question Answering0
LLMs-as-Instructors: Learning from Errors Toward Automating Model Improvement0
Assessing GPT4-V on Structured Reasoning Tasks0
Building Math Agents with Multi-Turn Iterative Preference Learning0
LLM Library Learning Fails: A LEGO-Prover Case Study0
Entropy-Aware Branching for Improved Mathematical Reasoning0
LLM Reasoning Engine: Specialized Training for Enhanced Mathematical Reasoning0
LLMs can be easily Confused by Instructional Distractions0
DavIR: Data Selection via Implicit Reward for Large Language Models0
Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning0
Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles0
Enhancing Reasoning through Process Supervision with Monte Carlo Tree Search0
Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models0
Enhancing Neural Mathematical Reasoning by Abductive Combination with Symbolic Library0
Enhancing Mathematical Reasoning in LLMs with Background Operators0
Are Large Language Models Robust in Understanding Code Against Semantics-Preserving Mutations?0
Enhancing Mathematical Reasoning in LLMs by Stepwise Correction0
Show:102550
← PrevPage 14 of 33Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAcc94.4Unverified
2DeepSeek-r1Acc79.8Unverified
3Openai-o1Acc74.4Unverified
4Openai-o1-miniAcc70Unverified
5s1-32BAcc56.7Unverified
6Search-o1Acc56.7Unverified
7Openai-o1-previewAcc44.6Unverified
8Qwen2.5-72B-InstructAcc23.3Unverified
9Claude3.5-SonnetAcc16Unverified
#ModelMetricClaimedVerifiedStatus
1o3Accuracy0.25Unverified
2Gemini 1.5 Pro (002)Accuracy0.02Unverified
3o1-previewAccuracy0.01Unverified
4GPT-4oAccuracy0.01Unverified
5Claude 3.5 SonnetAccuracy0.01Unverified
6o1-miniAccuracy0.01Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.6Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.48Unverified
3Neo-P (Fine-tuned, 2.7B)Accuracy0.39Unverified
4GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
5Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.25Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.2Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.59Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.45Unverified
3GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
4Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.27Unverified
5Neo-P (Fine-tuned, 2.7B)Accuracy0.24Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.18Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDCompletion accuracy65.8Unverified
2PGPSNetCompletion accuracy62.7Unverified
3GAPSCompletion accuracy61.2Unverified
4Inter-GPSCompletion accuracy59.8Unverified
5GeoformerCompletion accuracy35.6Unverified
6NGSCompletion accuracy34.1Unverified
#ModelMetricClaimedVerifiedStatus
1QWQ-32B-previewAcc82.5Unverified
2Math-MasterAcc82Unverified
3Qwen2.5-Math-7B-instructAcc62.5Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)75.2Unverified
2GAPSAccuracy (%)67.8Unverified
#ModelMetricClaimedVerifiedStatus
1Search-o1Acc86.4Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)98.5Unverified
#ModelMetricClaimedVerifiedStatus
1GAPSAccuracy (%)97.5Unverified