SOTAVerified|Agents Browse Leaderboard About

GSM8K

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 426–439 of 439 papers

Title	Date	Tasks	Status	Hype
TreeCut: A Synthetic Unanswerable Math Word Problem Dataset for LLM Hallucination Evaluation	Feb 19, 2025	Dataset GenerationGSM8K	CodeCode Available	0
TutorGym: A Testbed for Evaluating AI Agents as Tutors and Students	May 2, 2025	GSM8KIn-Context Learning	CodeCode Available	0
How to Leverage Demonstration Data in Alignment for Large Language Model? A Self-Imitation Learning Perspective	Oct 14, 2024	Density Ratio EstimationGSM8K	CodeCode Available	0
GKT: A Novel Guidance-Based Knowledge Transfer Framework For Efficient Cloud-edge Collaboration LLM Deployment	May 30, 2024	GSM8KKnowledge Distillation	CodeCode Available	0
Don't Get Lost in the Trees: Streamlining LLM Reasoning by Overcoming Tree Search Exploration Pitfalls	Feb 16, 2025	Computational EfficiencyGSM8K	CodeCode Available	0
Fill in the Blank: Exploring and Enhancing LLM Capabilities for Backward Reasoning in Math Word Problems	Oct 3, 2023	GSM8KMath	CodeCode Available	0
DIVE: Diversified Iterative Self-Improvement	Jan 1, 2025	DiversityGSM8K	CodeCode Available	0
ArithmAttack: Evaluating Robustness of LLMs to Noisy Context in Math Problem Solving	Jan 14, 2025	GSM8KMath	CodeCode Available	0
Exploring LLM Reasoning Through Controlled Prompt Variations	Apr 2, 2025	GSM8KMathematical Problem-Solving	CodeCode Available	0
Exploring Equation as a Better Intermediate Meaning Representation for Numerical Reasoning	Aug 21, 2023	GSM8K	CodeCode Available	0
Distilling Reasoning Capabilities into Smaller Language Models	Dec 1, 2022	GSM8KKnowledge Distillation	CodeCode Available	0
AdaCtrl: Towards Adaptive and Controllable Reasoning via Difficulty-Aware Budgeting	May 24, 2025	GSM8KReinforcement Learning (RL)	CodeCode Available	0
Discriminative Policy Optimization for Token-Level Reward Models	May 29, 2025	GSM8KLanguage Modeling	CodeCode Available	0
DiscQuant: A Quantization Method for Neural Networks Inspired by Discrepancy Theory	Jan 11, 2025	GSM8KQuantization	CodeCode Available	0

Show:10 25 50

← PrevPage 18 of 18Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Xolver	Accuracy	98.1	—	Unverified
2	Orange-mini	0-shot MRR	98	—	Unverified