SOTAVerified

Mathematical Reasoning

Papers

Showing 501550 of 805 papers

TitleStatusHype
Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision0
Enhancing Mathematical Reasoning in Large Language Models with Self-Consistency-Based Hallucination Detection0
Enhancing Mathematical Reasoning in LLMs by Stepwise Correction0
Enhancing Mathematical Reasoning in LLMs with Background Operators0
Enhancing Neural Mathematical Reasoning by Abductive Combination with Symbolic Library0
Enhancing Reasoning through Process Supervision with Monte Carlo Tree Search0
Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles0
Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning0
Entropy-Aware Branching for Improved Mathematical Reasoning0
Error Classification of Large Language Models on Math Word Problems: A Dynamically Adaptive Framework0
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection0
Evaluating Grounded Reasoning by Code-Assisted Large Language Models for Mathematics0
Evaluating Large Vision-and-Language Models on Children's Mathematical Olympiads0
Evaluating Robustness of Reward Models for Mathematical Reasoning0
Evaluating the Meta- and Object-Level Reasoning of Large Language Models for Question Answering0
Evaluation of LLMs for mathematical problem solving0
Evaluation of OpenAI o1: Opportunities and Challenges of AGI0
Evolutionary Pre-Prompt Optimization for Mathematical Reasoning0
Evolving LLMs' Self-Refinement Capability via Iterative Preference Optimization0
Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains0
Expanding Search Space with Diverse Prompting Agents: An Efficient Sampling Approach for LLM Mathematical Reasoning0
Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding0
Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation0
Exploring Mathematical Extrapolation of Large Language Models with Synthetic Data0
Exploring the Limits of Fine-grained LLM-based Physics Inference via Premise Removal Interventions0
Exploring the Mystery of Influential Data for Mathematical Reasoning0
Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning0
Federated Prompting and Chain-of-Thought Reasoning for Improving LLMs Answering0
FG-PRM: Fine-grained Hallucination Detection and Mitigation in Language Model Mathematical Reasoning0
FineMath: A Fine-Grained Mathematical Evaluation Benchmark for Chinese Large Language Models0
Fine-Tuning and Prompt Optimization: Two Great Steps that Work Better Together0
First-Step Advantage: Importance of Starting Right in Multi-Step Math Reasoning0
Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning0
Foreword: A Computable Universe, Understanding Computation and Exploring Nature As Computation0
Formal Mathematical Reasoning: A New Frontier in AI0
Fourier Circuits in Neural Networks and Transformers: A Case Study of Modular Arithmetic with Multiple Inputs0
From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks0
From Correctness to Comprehension: AI Agents for Personalized Error Diagnosis in Education0
From Good to Great: Improving Math Reasoning with Tool-Augmented Interleaf Prompting0
From Informal to Formal -- Incorporating and Evaluating LLMs on Natural Language Requirements to Verifiable Formal Proofs0
FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI0
Full-Step-DPO: Self-Supervised Preference Optimization with Step-wise Rewards for Mathematical Reasoning0
GAPS: Geometry-Aware Problem Solver0
GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning0
GeomVerse: A Systematic Evaluation of Large Models for Geometric Reasoning0
GFlowNet Fine-tuning for Diverse Correct Solutions in Mathematical Reasoning Tasks0
GoRA: Gradient-driven Adaptive Low Rank Adaptation0
GraphIC: A Graph-Based In-Context Example Retrieval Model for Multi-Step Reasoning0
GraphMR: Graph Neural Network for Mathematical Reasoning0
Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence0
Show:102550
← PrevPage 11 of 17Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAcc94.4Unverified
2DeepSeek-r1Acc79.8Unverified
3Openai-o1Acc74.4Unverified
4Openai-o1-miniAcc70Unverified
5s1-32BAcc56.7Unverified
6Search-o1Acc56.7Unverified
7Openai-o1-previewAcc44.6Unverified
8Qwen2.5-72B-InstructAcc23.3Unverified
9Claude3.5-SonnetAcc16Unverified
#ModelMetricClaimedVerifiedStatus
1o3Accuracy0.25Unverified
2Gemini 1.5 Pro (002)Accuracy0.02Unverified
3o1-previewAccuracy0.01Unverified
4GPT-4oAccuracy0.01Unverified
5Claude 3.5 SonnetAccuracy0.01Unverified
6o1-miniAccuracy0.01Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.6Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.48Unverified
3Neo-P (Fine-tuned, 2.7B)Accuracy0.39Unverified
4GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
5Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.25Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.2Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.59Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.45Unverified
3GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
4Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.27Unverified
5Neo-P (Fine-tuned, 2.7B)Accuracy0.24Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.18Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDCompletion accuracy65.8Unverified
2PGPSNetCompletion accuracy62.7Unverified
3GAPSCompletion accuracy61.2Unverified
4Inter-GPSCompletion accuracy59.8Unverified
5GeoformerCompletion accuracy35.6Unverified
6NGSCompletion accuracy34.1Unverified
#ModelMetricClaimedVerifiedStatus
1QWQ-32B-previewAcc82.5Unverified
2Math-MasterAcc82Unverified
3Qwen2.5-Math-7B-instructAcc62.5Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)75.2Unverified
2GAPSAccuracy (%)67.8Unverified
#ModelMetricClaimedVerifiedStatus
1Search-o1Acc86.4Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)98.5Unverified
#ModelMetricClaimedVerifiedStatus
1GAPSAccuracy (%)97.5Unverified