SOTAVerified

GSM8K

Papers

Showing 2650 of 439 papers

TitleStatusHype
LoRA-GA: Low-Rank Adaptation with Gradient ApproximationCode3
TokenSkip: Controllable Chain-of-Thought Compression in LLMsCode3
PAL: Program-aided Language ModelsCode3
Syzygy of Thoughts: Improving LLM CoT with the Minimal Free ResolutionCode3
Training Verifiers to Solve Math Word ProblemsCode3
Scaling up Masked Diffusion Models on TextCode3
Large Language Monkeys: Scaling Inference Compute with Repeated SamplingCode3
SkyMath: Technical ReportCode3
Beyond Chain-of-Thought, Effective Graph-of-Thought Reasoning in Language ModelsCode3
PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language ModelsCode3
LayerSkip: Enabling Early Exit Inference and Self-Speculative DecodingCode3
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMsCode3
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language ModelsCode2
How to Correctly do Semantic Backpropagation on Language-based Agentic SystemsCode2
Natural Language Fine-TuningCode2
Offline Reinforcement Learning for LLM Multi-Step ReasoningCode2
Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical TextsCode2
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language ModelsCode2
GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem SolversCode2
Meta Prompting for AI SystemsCode2
Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning ProcessCode2
Chain-of-Tools: Utilizing Massive Unseen Tools in the CoT Reasoning of Frozen Language ModelsCode2
LoRA-XS: Low-Rank Adaptation with Extremely Small Number of ParametersCode2
CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning ModelsCode2
Let LLMs Break Free from Overthinking via Self-Braking TuningCode2
Show:102550
← PrevPage 2 of 18Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAccuracy98.1Unverified
2Orange-mini0-shot MRR98Unverified