SOTAVerified

GSM8K

Papers

Showing 326350 of 439 papers

TitleStatusHype
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement0
CPL: Critical Plan Step Learning Boosts LLM Generalization in Reasoning Tasks0
STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning0
Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation0
Building Math Agents with Multi-Turn Iterative Preference Learning0
Prompt Baking0
S^3c-Math: Spontaneous Step-level Self-correction Makes Large Language Models Better Mathematical Reasoners0
Critic-CoT: Boosting the reasoning abilities of large language model via Chain-of-thoughts Critic0
Logic Contrastive Reasoning with Lightweight Large Language Model for Math Word Problems0
SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models0
Threshold Filtering Packing for Supervised Fine-Tuning: Training Related Samples within Packs0
SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models0
Cool-Fusion: Fuse Large Language Models without Training0
Concise Thoughts: Impact of Output Length on LLM Reasoning and Cost0
Reliable Reasoning Beyond Natural Language0
Token-Supervised Value Models for Enhancing Mathematical Reasoning Capabilities of Large Language Models0
Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist0
Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models -- The Story Goes On0
When is the consistent prediction likely to be a correct prediction?0
Question-Analysis Prompting Improves LLM Performance in Reasoning Tasks0
metabench -- A Sparse Benchmark to Measure General Ability in Large Language ModelsCode0
AgentInstruct: Toward Generative Teaching with Agentic Flows0
Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs0
Advancing Process Verification for Large Language Models via Tree-Based Preference Learning0
LiteSearch: Efficacious Tree Search for LLM0
Show:102550
← PrevPage 14 of 18Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAccuracy98.1Unverified
2Orange-mini0-shot MRR98Unverified