SOTAVerified

GSM8K

Papers

Showing 5175 of 439 papers

TitleStatusHype
Offline Reinforcement Learning for LLM Multi-Step ReasoningCode2
ProcessBench: Identifying Process Errors in Mathematical ReasoningCode2
How to Correctly do Semantic Backpropagation on Language-based Agentic SystemsCode2
Preference Optimization for Reasoning with Pseudo FeedbackCode2
Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-RewardingCode2
Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function OptimizationCode2
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language ModelsCode2
VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit AssignmentCode2
Balancing LoRA Performance and Efficiency with Simple Shard SharingCode2
CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal ModelsCode2
Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning ProcessCode2
Weak-to-Strong ReasoningCode2
LoRA-XS: Low-Rank Adaptation with Extremely Small Number of ParametersCode2
MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics BenchmarkCode2
Exploring the Compositional Deficiency of Large Language Models in Mathematical ReasoningCode2
Self-Explore: Enhancing Mathematical Reasoning in Language Models with Fine-grained RewardsCode2
LLM2LLM: Boosting LLMs with Novel Iterative Data EnhancementCode2
GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem SolversCode2
Reformatted AlignmentCode2
Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical TextsCode2
Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement LearningCode2
SuperCLUE-Math6: Graded Multi-Step Math Reasoning Benchmark for LLMs in ChineseCode2
Meta Prompting for AI SystemsCode2
Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free LunchCode2
MuggleMath: Assessing the Impact of Query and Response Augmentation on Math ReasoningCode2
Show:102550
← PrevPage 3 of 18Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAccuracy98.1Unverified
2Orange-mini0-shot MRR98Unverified