SOTAVerified

Mathematical Reasoning

Papers

Showing 301325 of 805 papers

TitleStatusHype
Causal Head Gating: A Framework for Interpreting Roles of Attention Heads in Transformers0
AI4Math: A Native Spanish Benchmark for University-Level Mathematical Reasoning in Large Language Models0
Let's Reinforce Step by Step0
Exploring Mathematical Extrapolation of Large Language Models with Synthetic Data0
Can Theoretical Physics Research Benefit from Language Agents?0
Assessing the Impact of Prompting Methods on ChatGPT's Mathematical Capabilities0
Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation0
Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding0
Can Pruning Improve Reasoning? Revisiting Long-CoT Compression with Capability in Mind for Better Reasoning0
LemmaHead: RAG Assisted Proof Generation Using Large Language Models0
Expanding Search Space with Diverse Prompting Agents: An Efficient Sampling Approach for LLM Mathematical Reasoning0
Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains0
Can LLMs understand Math? -- Exploring the Pitfalls in Mathematical Reasoning0
Evolving LLMs' Self-Refinement Capability via Iterative Preference Optimization0
Evolutionary Pre-Prompt Optimization for Mathematical Reasoning0
Assessing the Emergent Symbolic Reasoning Abilities of Llama Large Language Models0
Let's Reason Formally: Natural-Formal Hybrid Reasoning Enhances LLM's Math Capability0
Let's reward step by step: Step-Level reward model as the Navigators for Reasoning0
Evaluation of OpenAI o1: Opportunities and Challenges of AGI0
Evaluation of LLMs for mathematical problem solving0
Can Large Language Models Invent Algorithms to Improve Themselves?0
Evaluating the Meta- and Object-Level Reasoning of Large Language Models for Question Answering0
Evaluating Robustness of Reward Models for Mathematical Reasoning0
A Graph-Based Synthetic Data Pipeline for Scaling High-Quality Reasoning Instructions0
Can Large Language Models Explain Themselves? A Study of LLM-Generated Self-Explanations0
Show:102550
← PrevPage 13 of 33Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAcc94.4Unverified
2DeepSeek-r1Acc79.8Unverified
3Openai-o1Acc74.4Unverified
4Openai-o1-miniAcc70Unverified
5Search-o1Acc56.7Unverified
6s1-32BAcc56.7Unverified
7Openai-o1-previewAcc44.6Unverified
8Qwen2.5-72B-InstructAcc23.3Unverified
9Claude3.5-SonnetAcc16Unverified
#ModelMetricClaimedVerifiedStatus
1o3Accuracy0.25Unverified
2Gemini 1.5 Pro (002)Accuracy0.02Unverified
3GPT-4oAccuracy0.01Unverified
4o1-miniAccuracy0.01Unverified
5o1-previewAccuracy0.01Unverified
6Claude 3.5 SonnetAccuracy0.01Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.6Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.48Unverified
3Neo-P (Fine-tuned, 2.7B)Accuracy0.39Unverified
4GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
5Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.25Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.2Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.59Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.45Unverified
3GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
4Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.27Unverified
5Neo-P (Fine-tuned, 2.7B)Accuracy0.24Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.18Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDCompletion accuracy65.8Unverified
2PGPSNetCompletion accuracy62.7Unverified
3GAPSCompletion accuracy61.2Unverified
4Inter-GPSCompletion accuracy59.8Unverified
5GeoformerCompletion accuracy35.6Unverified
6NGSCompletion accuracy34.1Unverified
#ModelMetricClaimedVerifiedStatus
1QWQ-32B-previewAcc82.5Unverified
2Math-MasterAcc82Unverified
3Qwen2.5-Math-7B-instructAcc62.5Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)75.2Unverified
2GAPSAccuracy (%)67.8Unverified
#ModelMetricClaimedVerifiedStatus
1Search-o1Acc86.4Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)98.5Unverified
#ModelMetricClaimedVerifiedStatus
1GAPSAccuracy (%)97.5Unverified