SOTAVerified

Mathematical Reasoning

Papers

Showing 151175 of 805 papers

TitleStatusHype
Real-Time Verification of Embodied Reasoning for Generative Skill Acquisition0
Scaling Reasoning can Improve Factuality in Large Language ModelsCode0
Group-in-Group Policy Optimization for LLM Agent TrainingCode5
Reasoning on a Budget: Miniaturizing DeepSeek R1 with SFT-GRPO Alignment for Instruction-Tuned LLMsCode1
MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical ReasoningCode3
Are Large Language Models Robust in Understanding Code Against Semantics-Preserving Mutations?0
ComplexFormer: Disruptively Advancing Transformer Inference Ability via Head-Specific Complex Vector AttentionCode0
DRA-GRPO: Exploring Diversity-Aware Reward Adjustment for R1-Zero-Like Training of Large Language ModelsCode1
Qwen3 Technical ReportCode14
Learning Like Humans: Advancing LLM Reasoning Capabilities via Adaptive Difficulty Curriculum Learning and Expert-Guided Self-Reformulation0
Agent-as-a-Service based on Agent Network0
Agent RL Scaling Law: Agent RL with Spontaneous Code Execution for Mathematical Problem SolvingCode2
Assessing Robustness to Spurious Correlations in Post-Training Language Models0
Crosslingual Reasoning through Test-Time ScalingCode1
Knowledge Augmented Complex Problem Solving with Large Language Models: A Survey0
Absolute Zero: Reinforced Self-play Reasoning with Zero DataCode11
Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RLCode1
FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language ModelsCode2
Rewriting Pre-Training Data Boosts LLM Performance in Math and CodeCode1
DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal DecompositionCode5
RV-Syn: Rational and Verifiable Mathematical Reasoning Data Synthesis based on Structured Function Library0
Beyond the Last Answer: Your Reasoning Trace Uncovers More than You ThinkCode0
Reinforcement Learning for Reasoning in Large Language Models with One Training ExampleCode3
Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning0
Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward ModelsCode0
Show:102550
← PrevPage 7 of 33Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAcc94.4Unverified
2DeepSeek-r1Acc79.8Unverified
3Openai-o1Acc74.4Unverified
4Openai-o1-miniAcc70Unverified
5s1-32BAcc56.7Unverified
6Search-o1Acc56.7Unverified
7Openai-o1-previewAcc44.6Unverified
8Qwen2.5-72B-InstructAcc23.3Unverified
9Claude3.5-SonnetAcc16Unverified
#ModelMetricClaimedVerifiedStatus
1o3Accuracy0.25Unverified
2Gemini 1.5 Pro (002)Accuracy0.02Unverified
3o1-previewAccuracy0.01Unverified
4GPT-4oAccuracy0.01Unverified
5Claude 3.5 SonnetAccuracy0.01Unverified
6o1-miniAccuracy0.01Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.6Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.48Unverified
3Neo-P (Fine-tuned, 2.7B)Accuracy0.39Unverified
4GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
5Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.25Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.2Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.59Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.45Unverified
3GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
4Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.27Unverified
5Neo-P (Fine-tuned, 2.7B)Accuracy0.24Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.18Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDCompletion accuracy65.8Unverified
2PGPSNetCompletion accuracy62.7Unverified
3GAPSCompletion accuracy61.2Unverified
4Inter-GPSCompletion accuracy59.8Unverified
5GeoformerCompletion accuracy35.6Unverified
6NGSCompletion accuracy34.1Unverified
#ModelMetricClaimedVerifiedStatus
1QWQ-32B-previewAcc82.5Unverified
2Math-MasterAcc82Unverified
3Qwen2.5-Math-7B-instructAcc62.5Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)75.2Unverified
2GAPSAccuracy (%)67.8Unverified
#ModelMetricClaimedVerifiedStatus
1Search-o1Acc86.4Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)98.5Unverified
#ModelMetricClaimedVerifiedStatus
1GAPSAccuracy (%)97.5Unverified