SOTAVerified

GSM8K

Papers

Showing 51100 of 439 papers

TitleStatusHype
Offline Reinforcement Learning for LLM Multi-Step ReasoningCode2
ProcessBench: Identifying Process Errors in Mathematical ReasoningCode2
How to Correctly do Semantic Backpropagation on Language-based Agentic SystemsCode2
Preference Optimization for Reasoning with Pseudo FeedbackCode2
Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-RewardingCode2
Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function OptimizationCode2
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language ModelsCode2
VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit AssignmentCode2
Balancing LoRA Performance and Efficiency with Simple Shard SharingCode2
CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal ModelsCode2
Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning ProcessCode2
Weak-to-Strong ReasoningCode2
LoRA-XS: Low-Rank Adaptation with Extremely Small Number of ParametersCode2
MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics BenchmarkCode2
Exploring the Compositional Deficiency of Large Language Models in Mathematical ReasoningCode2
Self-Explore: Enhancing Mathematical Reasoning in Language Models with Fine-grained RewardsCode2
LLM2LLM: Boosting LLMs with Novel Iterative Data EnhancementCode2
GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem SolversCode2
Reformatted AlignmentCode2
Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical TextsCode2
Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement LearningCode2
SuperCLUE-Math6: Graded Multi-Step Math Reasoning Benchmark for LLMs in ChineseCode2
Meta Prompting for AI SystemsCode2
Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free LunchCode2
MuggleMath: Assessing the Impact of Query and Response Augmentation on Math ReasoningCode2
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical ReasoningCode2
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language ModelsCode2
Scaling Relationship on Learning Mathematical Reasoning with Large Language ModelsCode2
Progressive-Hint Prompting Improves Reasoning in Large Language ModelsCode2
Language Models are Multilingual Chain-of-Thought ReasonersCode2
Large Language Models are Zero-Shot ReasonersCode2
IRanker: Towards Ranking Foundation ModelCode1
CommVQ: Commutative Vector Quantization for KV Cache CompressionCode1
Xolver: Multi-Agent Reasoning with Holistic Experience Learning Just Like an Olympiad TeamCode1
Topology of Reasoning: Understanding Large Reasoning Models through Reasoning Graph PropertiesCode1
Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language ModelsCode1
Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context LearningCode1
Rewriting Pre-Training Data Boosts LLM Performance in Math and CodeCode1
NeMo-Inspector: A Visualization Tool for LLM Generation AnalysisCode1
Efficient Reasoning for LLMs through Speculative Chain-of-ThoughtCode1
Large (Vision) Language Models are Unsupervised In-Context LearnersCode1
Entropy-Based Adaptive Weighting for Self-TrainingCode1
SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model MergingCode1
Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language ModelsCode1
PromptCoT: Synthesizing Olympiad-level Problems for Mathematical Reasoning in Large Language ModelsCode1
FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle SolvingCode1
Self-Training Elicits Concise Reasoning in Large Language ModelsCode1
SMART: Self-Aware Agent for Tool Overuse MitigationCode1
MyGO Multiplex CoT: A Method for Self-Reflection in Large Language Models via Double Chain of Thought ThinkingCode1
Entropy-Regularized Process Reward ModelCode1
Show:102550
← PrevPage 2 of 9Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAccuracy98.1Unverified
2Orange-mini0-shot MRR98Unverified