SOTAVerified

GSM8K

Papers

Showing 151175 of 439 papers

TitleStatusHype
FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle SolvingCode1
Large Language Models as OptimizersCode1
Math Neurosurgery: Isolating Language Models' Math Reasoning Abilities Using Only Forward PassesCode1
Over-Reasoning and Redundant Calculation of Large Language ModelsCode1
MyGO Multiplex CoT: A Method for Self-Reflection in Large Language Models via Double Chain of Thought ThinkingCode1
Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided InterventionsCode1
Fine-Grained Self-Endorsement Improves Factuality and Reasoning0
FG-PRM: Fine-grained Hallucination Detection and Mitigation in Language Model Mathematical Reasoning0
CoRE: Enhancing Metacognition with Label-free Self-evaluation in LRMs0
Automatic Robustness Stress Testing of LLMs as Mathematical Problem Solvers0
Fast on the Easy, Deep on the Hard: Efficient Reasoning via Powered Length Penalty0
A Careful Examination of Large Language Model Performance on Grade School Arithmetic0
Falcon: Faster and Parallel Inference of Large Language Models through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree0
Cool-Fusion: Fuse Large Language Models without Training0
Automatic Prompt Selection for Large Language Models0
ControlMath: Controllable Data Generation Promotes Math Generalist Models0
Meaning-Typed Programming: Language Abstraction and Runtime for Model-Integrated Applications0
Exploring an LM to generate Prolog Predicates from Mathematics Questions0
Explicit Knowledge Transfer for Weakly-Supervised Code Generation0
Contrastive Decoding Improves Reasoning in Large Language Models0
Excessive Reasoning Attack on Reasoning LLMs0
Evolving LLMs' Self-Refinement Capability via Iterative Preference Optimization0
Concise Thoughts: Impact of Output Length on LLM Reasoning and Cost0
Evolutionary Pre-Prompt Optimization for Mathematical Reasoning0
Evaluation of LLMs for mathematical problem solving0
Show:102550
← PrevPage 7 of 18Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAccuracy98.1Unverified
2Orange-mini0-shot MRR98Unverified