SOTAVerified|Agents Browse Leaderboard About

GSM8K

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 421–430 of 439 papers

Title	Date	Tasks	Status	Hype
VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation	Jun 25, 2024	ARCBenchmarking	CodeCode Available	0
Inference Scaling vs Reasoning: An Empirical Analysis of Compute-Optimal LLM Problem-Solving	Dec 20, 2024	Computational EfficiencyGSM8K	CodeCode Available	0
Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective	Feb 20, 2025	GSM8KMath	CodeCode Available	0
In-Context Principle Learning from Mistakes	Feb 8, 2024	GSM8KIn-Context Learning	CodeCode Available	0
AlignedCoT: Prompting Large Language Models via Native-Speaking Demonstrations	Nov 22, 2023	Common Sense ReasoningGSM8K	CodeCode Available	0
TreeCut: A Synthetic Unanswerable Math Word Problem Dataset for LLM Hallucination Evaluation	Feb 19, 2025	Dataset GenerationGSM8K	CodeCode Available	0
TutorGym: A Testbed for Evaluating AI Agents as Tutors and Students	May 2, 2025	GSM8KIn-Context Learning	CodeCode Available	0
How to Leverage Demonstration Data in Alignment for Large Language Model? A Self-Imitation Learning Perspective	Oct 14, 2024	Density Ratio EstimationGSM8K	CodeCode Available	0
GKT: A Novel Guidance-Based Knowledge Transfer Framework For Efficient Cloud-edge Collaboration LLM Deployment	May 30, 2024	GSM8KKnowledge Distillation	CodeCode Available	0
Don't Get Lost in the Trees: Streamlining LLM Reasoning by Overcoming Tree Search Exploration Pitfalls	Feb 16, 2025	Computational EfficiencyGSM8K	CodeCode Available	0

Show:10 25 50

← PrevPage 43 of 44Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Xolver	Accuracy	98.1	—	Unverified
2	Orange-mini	0-shot MRR	98	—	Unverified