SOTAVerified

GSM8K

Papers

Showing 5175 of 439 papers

TitleStatusHype
Chain-of-Tools: Utilizing Massive Unseen Tools in the CoT Reasoning of Frozen Language ModelsCode2
Scaling Relationship on Learning Mathematical Reasoning with Large Language ModelsCode2
Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical TextsCode2
MuggleMath: Assessing the Impact of Query and Response Augmentation on Math ReasoningCode2
Preference Optimization for Reasoning with Pseudo FeedbackCode2
GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem SolversCode2
ProcessBench: Identifying Process Errors in Mathematical ReasoningCode2
How to Correctly do Semantic Backpropagation on Language-based Agentic SystemsCode2
Progressive-Hint Prompting Improves Reasoning in Large Language ModelsCode2
Reformatted AlignmentCode2
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language ModelsCode2
any4: Learned 4-bit Numeric Representation for LLMsCode2
Natural Language Fine-TuningCode2
Exploring the Compositional Deficiency of Large Language Models in Mathematical ReasoningCode2
Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language ModelsCode2
Offline Reinforcement Learning for LLM Multi-Step ReasoningCode2
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language ModelsCode2
Meta Prompting for AI SystemsCode2
Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function OptimizationCode2
CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning ModelsCode2
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical ReasoningCode2
Dynamic Early Exit in Reasoning ModelsCode2
Balancing LoRA Performance and Efficiency with Simple Shard SharingCode2
CoT-Valve: Length-Compressible Chain-of-Thought TuningCode2
Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-RewardingCode2
Show:102550
← PrevPage 3 of 18Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAccuracy98.1Unverified
2Orange-mini0-shot MRR98Unverified