SOTAVerified|Agents Browse Leaderboard About

GSM8K

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 261–270 of 439 papers

Title	Date	Tasks	Status	Hype
AgentInstruct: Toward Generative Teaching with Agentic Flows	Jul 3, 2024	GSM8KMMLU	—Unverified	0
Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs	Jul 1, 2024	DiversityGSM8K	—Unverified	0
Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning	Jun 30, 2024	GSM8KMath	CodeCode Available	1
Advancing Process Verification for Large Language Models via Tree-Based Preference Learning	Jun 29, 2024	Binary ClassificationGSM8K	—Unverified	0
LiteSearch: Efficacious Tree Search for LLM	Jun 29, 2024	GSM8KMathematical Reasoning	—Unverified	0
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs	Jun 26, 2024	Arithmetic ReasoningGSM8K	CodeCode Available	3
VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation	Jun 25, 2024	ARCBenchmarking	CodeCode Available	0
PORT: Preference Optimization on Reasoning Traces	Jun 23, 2024	ARCGSM8K	—Unverified	0
Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation	Jun 20, 2024	GSM8KLanguage Model Evaluation	CodeCode Available	0
LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback	Jun 20, 2024	Binary ClassificationGSM8K	CodeCode Available	1

Show:10 25 50

← PrevPage 27 of 44Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Xolver	Accuracy	98.1	—	Unverified
2	Orange-mini	0-shot MRR	98	—	Unverified