SOTAVerified

GSM8K

Papers

Showing 226250 of 439 papers

TitleStatusHype
Balancing LoRA Performance and Efficiency with Simple Shard SharingCode2
LogicPro: Improving Complex Logical Reasoning via Program-Guided LearningCode0
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement0
Improving LLM Reasoning with Multi-Agent Tree-of-Thought Validator AgentCode1
CPL: Critical Plan Step Learning Boosts LLM Generalization in Reasoning Tasks0
STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning0
Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation0
Prompt Baking0
CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal ModelsCode2
Building Math Agents with Multi-Turn Iterative Preference Learning0
S^3c-Math: Spontaneous Step-level Self-correction Makes Large Language Models Better Mathematical Reasoners0
Logic Contrastive Reasoning with Lightweight Large Language Model for Math Word Problems0
Critic-CoT: Boosting the reasoning abilities of large language model via Chain-of-thoughts Critic0
SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models0
SORSA: Singular Values and Orthonormal Regularized Singular Vectors Adaptation of Large Language ModelsCode1
Threshold Filtering Packing for Supervised Fine-Tuning: Training Related Samples within Packs0
SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models0
Mutual Reasoning Makes Smaller LLMs Stronger Problem-SolversCode4
Mathfish: Evaluating Language Model Math Reasoning via Grounding in Educational CurriculaCode1
Large Language Monkeys: Scaling Inference Compute with Repeated SamplingCode3
Cool-Fusion: Fuse Large Language Models without Training0
Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning ProcessCode2
Concise Thoughts: Impact of Output Length on LLM Reasoning and Cost0
Learning Goal-Conditioned Representations for Language Reward ModelsCode1
Weak-to-Strong ReasoningCode2
Show:102550
← PrevPage 10 of 18Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAccuracy98.1Unverified
2Orange-mini0-shot MRR98Unverified