SOTAVerified|Agents Browse Leaderboard About

GSM8K

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 231–240 of 439 papers

Title	Date	Tasks	Status	Hype
FG-PRM: Fine-grained Hallucination Detection and Mitigation in Language Model Mathematical Reasoning	Oct 8, 2024	GSM8KHallucination	—Unverified	0
Fine-Grained Self-Endorsement Improves Factuality and Reasoning	Feb 23, 2024	GSM8KLanguage Modeling	—Unverified	0
First-Step Advantage: Importance of Starting Right in Multi-Step Math Reasoning	Nov 14, 2023	GSM8KMath	—Unverified	0
Fractional Reasoning via Latent Steering Vectors Improves Inference Time Compute	Jun 18, 2025	continuous-controlContinuous Control	—Unverified	0
From Correctness to Comprehension: AI Agents for Personalized Error Diagnosis in Education	Feb 19, 2025	DiagnosticGSM8K	—Unverified	0
From Good to Great: Improving Math Reasoning with Tool-Augmented Interleaf Prompting	Dec 18, 2023	DiversityGSM8K	—Unverified	0
From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference	Oct 4, 2023	BenchmarkingGPU	—Unverified	0
GEMMAS: Graph-based Evaluation Metrics for Multi Agent Systems	Jul 17, 2025	DiversityGSM8K	—Unverified	0
GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements	Feb 13, 2024	GSM8KMath	—Unverified	0
Guideline Forest: Experience-Induced Multi-Guideline Reasoning with Stepwise Aggregation	Jun 9, 2025	GSM8KHumanEval	—Unverified	0

Show:10 25 50

← PrevPage 24 of 44Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Xolver	Accuracy	98.1	—	Unverified
2	Orange-mini	0-shot MRR	98	—	Unverified