SOTAVerified

GSM8K

Papers

Showing 150 of 439 papers

TitleStatusHype
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All ToolsCode14
Qwen2 Technical ReportCode13
LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-TuningCode9
LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt CompressionCode9
Qwen2.5-Omni Technical ReportCode7
Revisiting MoE and Dense Speed-Accuracy Comparisons for LLM TrainingCode7
Chain-of-Thought Prompting Elicits Reasoning in Large Language ModelsCode6
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-InstructCode5
LLMLingua: Compressing Prompts for Accelerated Inference of Large Language ModelsCode5
Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8BCode5
Common 7B Language Models Already Possess Strong Math CapabilitiesCode5
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning DatasetCode4
SepLLM: Accelerate Large Language Models by Compressing One Segment into One SeparatorCode4
SuperCorrect: Supervising and Correcting Language Models with Error-Driven InsightsCode4
ReFT: Reasoning with Reinforced Fine-TuningCode4
Quiet-STaR: Language Models Can Teach Themselves to Think Before SpeakingCode4
Baichuan 2: Open Large-scale Language ModelsCode4
InternLM-Math: Open Math Large Language Models Toward Verifiable ReasoningCode4
Mutual Reasoning Makes Smaller LLMs Stronger Problem-SolversCode4
PAL: Program-aided Language ModelsCode3
Automatic Instruction Evolving for Large Language ModelsCode3
TokenSkip: Controllable Chain-of-Thought Compression in LLMsCode3
MuMath-Code: Combining Tool-Use Large Language Models with Multi-perspective Data Augmentation for Mathematical ReasoningCode3
Thinkless: LLM Learns When to ThinkCode3
From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by StepCode3
MARIO: MAth Reasoning with code Interpreter Output -- A Reproducible PipelineCode3
Training Verifiers to Solve Math Word ProblemsCode3
Monte Carlo Tree Search Boosts Reasoning via Iterative Preference LearningCode3
LoRA-GA: Low-Rank Adaptation with Gradient ApproximationCode3
Syzygy of Thoughts: Improving LLM CoT with the Minimal Free ResolutionCode3
Scaling up Masked Diffusion Models on TextCode3
Large Language Monkeys: Scaling Inference Compute with Repeated SamplingCode3
SkyMath: Technical ReportCode3
Beyond Chain-of-Thought, Effective Graph-of-Thought Reasoning in Language ModelsCode3
PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language ModelsCode3
LayerSkip: Enabling Early Exit Inference and Self-Speculative DecodingCode3
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMsCode3
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language ModelsCode2
How to Correctly do Semantic Backpropagation on Language-based Agentic SystemsCode2
GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem SolversCode2
Offline Reinforcement Learning for LLM Multi-Step ReasoningCode2
Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical TextsCode2
Meta Prompting for AI SystemsCode2
Exploring the Compositional Deficiency of Large Language Models in Mathematical ReasoningCode2
Natural Language Fine-TuningCode2
Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning ProcessCode2
Chain-of-Tools: Utilizing Massive Unseen Tools in the CoT Reasoning of Frozen Language ModelsCode2
Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function OptimizationCode2
Dynamic Early Exit in Reasoning ModelsCode2
LoRA-XS: Low-Rank Adaptation with Extremely Small Number of ParametersCode2
Show:102550
← PrevPage 1 of 9Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAccuracy98.1Unverified
2Orange-mini0-shot MRR98Unverified