SOTAVerified

GSM8K

Papers

Showing 2650 of 439 papers

TitleStatusHype
Topology of Reasoning: Understanding Large Reasoning Models through Reasoning Graph PropertiesCode1
Automatic Robustness Stress Testing of LLMs as Mathematical Problem Solvers0
Evaluation of LLMs for mathematical problem solving0
Model Unlearning via Sparse Autoencoder Subspace Guided Projections0
Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language ModelsCode1
Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation0
Discriminative Policy Optimization for Token-Level Reward ModelsCode0
CoThink: Token-Efficient Reasoning via Instruct Models Guiding Reasoning Models0
Maximizing Confidence Alone Improves Reasoning0
LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models0
System-1.5 Reasoning: Traversal in Language and Latent Spaces with Dynamic Shortcuts0
The Price of Format: Diversity Collapse in LLMsCode0
Efficient Data Selection at Scale via Influence Distillation0
Steering LLM Reasoning Through Bias-Only Adaptation0
AdaCtrl: Towards Adaptive and Controllable Reasoning via Difficulty-Aware BudgetingCode0
PMPO: Probabilistic Metric Prompt Optimization for Small and Large Language Models0
EquivPruner: Boosting Efficiency and Quality in LLM-Based Search via Action PruningCode0
Learning to Rank Chain-of-Thought: An Energy-Based Approach with Outcome Supervision0
Self-Reasoning Language Models: Unfold Hidden Reasoning Chains with Few Reasoning Catalyst0
Dual Decomposition of Weights and Singular Value Low Rank Adaptation0
DRP: Distilled Reasoning Pruning with Skill-aware Step Decomposition for Efficient Large Reasoning Models0
Let LLMs Break Free from Overthinking via Self-Braking TuningCode2
RL in Name Only? Analyzing the Structural Assumptions in RL post-training for LLMs0
Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent SpaceCode2
Thinkless: LLM Learns When to ThinkCode3
Show:102550
← PrevPage 2 of 18Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAccuracy98.1Unverified
2Orange-mini0-shot MRR98Unverified