SOTAVerified

GSM8K

Papers

Showing 201225 of 439 papers

TitleStatusHype
Scheherazade: Evaluating Chain-of-Thought Math Reasoning in LLMs with Chain-of-ProblemsCode0
SEGO: Sequential Subgoal Optimization for Mathematical Problem-SolvingCode0
ShareLoRA: Parameter Efficient and Robust Large Language Model Fine-tuning via Shared Low-Rank AdaptationCode0
SMART: Self-learning Meta-strategy Agent for Reasoning TasksCode0
AlignedCoT: Prompting Large Language Models via Native-Speaking DemonstrationsCode0
Text-to-LoRA: Instant Transformer AdaptionCode0
metabench -- A Sparse Benchmark to Measure General Ability in Large Language ModelsCode0
The Price of Format: Diversity Collapse in LLMsCode0
TreeCut: A Synthetic Unanswerable Math Word Problem Dataset for LLM Hallucination EvaluationCode0
TutorGym: A Testbed for Evaluating AI Agents as Tutors and StudentsCode0
Upweighting Easy Samples in Fine-Tuning Mitigates ForgettingCode0
VarBench: Robust Language Model Benchmarking Through Dynamic Variable PerturbationCode0
Iterative Reasoning Preference Optimization0
Efficient Contextual LLM Cascades through Budget-Constrained Policy Learning0
MALT: Improving Reasoning with Multi-Agent LLM Training0
MAmmoTH2: Scaling Instructions from the Web0
Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist0
Is your LLM trapped in a Mental Set? Investigative study on how mental sets affect the reasoning capabilities of LLMs0
Interpretable Math Word Problem Solution Generation Via Step-by-step Planning0
MathAttack: Attacking Large Language Models Towards Math Solving Ability0
Integrating Arithmetic Learning Improves Mathematical Reasoning in Smaller Models0
Instance-adaptive Zero-shot Chain-of-Thought Prompting0
MathDivide: Improved mathematical reasoning by large language models0
TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling0
MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task0
Show:102550
← PrevPage 9 of 18Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAccuracy98.1Unverified
2Orange-mini0-shot MRR98Unverified