SOTAVerified

GSM8K

Papers

Showing 201250 of 439 papers

TitleStatusHype
Scheherazade: Evaluating Chain-of-Thought Math Reasoning in LLMs with Chain-of-ProblemsCode0
SEGO: Sequential Subgoal Optimization for Mathematical Problem-SolvingCode0
ShareLoRA: Parameter Efficient and Robust Large Language Model Fine-tuning via Shared Low-Rank AdaptationCode0
SMART: Self-learning Meta-strategy Agent for Reasoning TasksCode0
AlignedCoT: Prompting Large Language Models via Native-Speaking DemonstrationsCode0
Text-to-LoRA: Instant Transformer AdaptionCode0
metabench -- A Sparse Benchmark to Measure General Ability in Large Language ModelsCode0
The Price of Format: Diversity Collapse in LLMsCode0
TreeCut: A Synthetic Unanswerable Math Word Problem Dataset for LLM Hallucination EvaluationCode0
TutorGym: A Testbed for Evaluating AI Agents as Tutors and StudentsCode0
Unsupervised Elicitation of Language ModelsCode0
Upweighting Easy Samples in Fine-Tuning Mitigates ForgettingCode0
VarBench: Robust Language Model Benchmarking Through Dynamic Variable PerturbationCode0
Transcending Scaling Laws with 0.1% Extra Compute0
MALT: Improving Reasoning with Multi-Agent LLM Training0
MAmmoTH2: Scaling Instructions from the Web0
Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist0
Is your LLM trapped in a Mental Set? Investigative study on how mental sets affect the reasoning capabilities of LLMs0
Interpretable Math Word Problem Solution Generation Via Step-by-step Planning0
MathAttack: Attacking Large Language Models Towards Math Solving Ability0
Integrating Arithmetic Learning Improves Mathematical Reasoning in Smaller Models0
Instance-adaptive Zero-shot Chain-of-Thought Prompting0
MathDivide: Improved mathematical reasoning by large language models0
Efficient Contextual LLM Cascades through Budget-Constrained Policy Learning0
MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task0
MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs0
InfiFusion: A Unified Framework for Enhanced Cross-Model Reasoning via LLM Fusion0
TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling0
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification0
Maximizing Confidence Alone Improves Reasoning0
Memory-Efficient LLM Training by Various-Grained Low-Rank Projection of Gradients0
Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving0
Improving Complex Reasoning with Dynamic Prompt Corruption: A soft prompt Optimization Approach0
Improve Mathematical Reasoning in Language Models by Automated Process Supervision0
MIND: Math Informed syNthetic Dialogues for Pretraining LLMs0
MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time0
Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs0
Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference0
Model Unlearning via Sparse Autoencoder Subspace Guided Projections0
Hybrid Offline-online Scheduling Method for Large Language Model Inference Optimization0
Guideline Forest: Experience-Induced Multi-Guideline Reasoning with Stepwise Aggregation0
Multi-Reference Preference Optimization for Large Language Models0
Multi-step Problem Solving Through a Verifier: An Empirical Analysis on Model-induced Process Supervision0
GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements0
GEMMAS: Graph-based Evaluation Metrics for Multi Agent Systems0
From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference0
From Good to Great: Improving Math Reasoning with Tool-Augmented Interleaf Prompting0
From Correctness to Comprehension: AI Agents for Personalized Error Diagnosis in Education0
Fractional Reasoning via Latent Steering Vectors Improves Inference Time Compute0
First-Step Advantage: Importance of Starting Right in Multi-Step Math Reasoning0
Show:102550
← PrevPage 5 of 9Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAccuracy98.1Unverified
2Orange-mini0-shot MRR98Unverified