| HARDMath: A Benchmark Dataset for Challenging Problems in Applied Mathematics | Oct 13, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Let's Verify Math Questions Step by Step | May 20, 2025 | MathMathematical Reasoning | CodeCode Available | 1 | 5 |
| Learning From Mistakes Makes LLM Better Reasoner | Oct 31, 2023 | GSM8KMath | CodeCode Available | 1 | 5 |
| ReflectionCoder: Learning from Reflection Sequence for Enhanced One-off Code Generation | May 27, 2024 | Code GenerationHumanEval | CodeCode Available | 1 | 5 |
| Augmenting Math Word Problems via Iterative Question Composing | Jan 17, 2024 | MathMathematical Reasoning | CodeCode Available | 1 | 5 |
| Learning Multi-Step Reasoning by Solving Arithmetic Tasks | Jun 2, 2023 | MathMathematical Reasoning | CodeCode Available | 1 | 5 |
| Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification | Jun 5, 2025 | Automated Theorem ProvingHallucination | CodeCode Available | 1 | 5 |
| GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models | Apr 13, 2025 | Mathematical Reasoning | CodeCode Available | 1 | 5 |
| Large Language Models for Multi-Robot Systems: A Survey | Feb 6, 2025 | Action GenerationBenchmarking | CodeCode Available | 1 | 5 |
| Learning From Failure: Integrating Negative Examples when Fine-tuning Large Language Models as Agents | Feb 18, 2024 | Mathematical ReasoningMulti-hop Question Answering | CodeCode Available | 1 | 5 |
| Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models | May 20, 2025 | Instruction FollowingMathematical Reasoning | CodeCode Available | 1 | 5 |
| GOLD: Geometry Problem Solver with Natural Language Description | May 1, 2024 | Math | CodeCode Available | 1 | 5 |
| CoMAT: Chain of Mathematically Annotated Thought Improves Mathematical Reasoning | Oct 14, 2024 | MathMathematical Reasoning | CodeCode Available | 1 | 5 |
| GeoQA: A Geometric Question Answering Benchmark Towards Multimodal Numerical Reasoning | May 30, 2021 | MathMathematical Reasoning | CodeCode Available | 1 | 5 |
| KTAE: A Model-Free Algorithm to Key-Tokens Advantage Estimation in Mathematical Reasoning | May 22, 2025 | Mathematical Reasoningreinforcement-learning | CodeCode Available | 1 | 5 |
| Alice: Proactive Learning with Teacher's Demonstrations for Weak-to-Strong Generalization | Apr 9, 2025 | Logical ReasoningMathematical Reasoning | CodeCode Available | 1 | 5 |
| R-PRM: Reasoning-Driven Process Reward Modeling | Mar 27, 2025 | Mathematical Reasoning | CodeCode Available | 1 | 5 |
| Unlocking Reasoning Potential in Large Langauge Models by Scaling Code-form Planning | Sep 19, 2024 | FormInstruction Following | CodeCode Available | 1 | 5 |
| Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning | May 10, 2021 | Arithmetic ReasoningGeometry Problem Solving | CodeCode Available | 1 | 5 |
| JiuZhang3.0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis Models | May 23, 2024 | Knowledge DistillationMath | CodeCode Available | 1 | 5 |
| Lila: A Unified Benchmark for Mathematical Reasoning | Oct 31, 2022 | DiversityMathematical Reasoning | CodeCode Available | 1 | 5 |
| Rewriting Pre-Training Data Boosts LLM Performance in Math and Code | May 5, 2025 | Code GenerationGSM8K | CodeCode Available | 1 | 5 |
| Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions | Jan 17, 2024 | Arithmetic ReasoningCode Generation | CodeCode Available | 1 | 5 |
| FRoG: Evaluating Fuzzy Reasoning of Generalized Quantifiers in Large Language Models | Jul 1, 2024 | Mathematical Reasoning | CodeCode Available | 0 | 5 |
| A Survey on Mathematical Reasoning and Optimization with Large Language Models | Mar 22, 2025 | Automated Theorem ProvingHeuristic Search | CodeCode Available | 0 | 5 |
| Reasoning with Transformer-based Models: Deep Learning, but Shallow Reasoning | Jun 22, 2021 | Deep LearningLogical Reasoning | CodeCode Available | 0 | 5 |
| Mathematical Formalized Problem Solving and Theorem Proving in Different Fields in Lean 4 | Sep 9, 2024 | Abstract AlgebraAutomated Theorem Proving | CodeCode Available | 0 | 5 |
| Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models | Mar 27, 2025 | Data VisualizationMath | CodeCode Available | 0 | 5 |
| PSPO*: An Effective Process-supervised Policy Optimization for Reasoning Alignment | Nov 18, 2024 | Mathematical Reasoning | CodeCode Available | 0 | 5 |
| CER: Confidence Enhanced Reasoning in LLMs | Feb 20, 2025 | MathMathematical Reasoning | CodeCode Available | 0 | 5 |
| Probability-Consistent Preference Optimization for Enhanced LLM Reasoning | May 29, 2025 | Mathematical Reasoning | CodeCode Available | 0 | 5 |
| AI-Assisted Generation of Difficult Math Questions | Jul 30, 2024 | MathMathematical Reasoning | CodeCode Available | 0 | 5 |
| Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models | Nov 19, 2024 | Mathematical Reasoning | CodeCode Available | 0 | 5 |
| A Survey of Deep Learning for Geometry Problem Solving | Jul 16, 2025 | Deep LearningGeometry Problem Solving | CodeCode Available | 0 | 5 |
| Process-based Self-Rewarding Language Models | Mar 5, 2025 | Mathematical Reasoning | CodeCode Available | 0 | 5 |
| Planning and Editing What You Retrieve for Enhanced Tool Learning | Mar 30, 2024 | Mathematical ReasoningRetrieval | CodeCode Available | 0 | 5 |
| Explanation Selection Using Unlabeled Data for Chain-of-Thought Prompting | Feb 9, 2023 | Mathematical ReasoningNatural Language Inference | CodeCode Available | 0 | 5 |
| Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark | Oct 6, 2024 | Mathematical ReasoningSpatial Reasoning | CodeCode Available | 0 | 5 |
| Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement | Feb 18, 2024 | Mathematical ReasoningText Generation | CodeCode Available | 0 | 5 |
| Can LLMs Solve longer Math Word Problems Better? | May 23, 2024 | Data AugmentationMath | CodeCode Available | 0 | 5 |
| Overcoming Barriers to Skill Injection in Language Modeling: Case Study in Arithmetic | Nov 3, 2022 | Arithmetic ReasoningLanguage Modeling | CodeCode Available | 0 | 5 |
| Paraphrase and Solve: Exploring and Exploiting the Impact of Surface Form on Mathematical Reasoning in Large Language Models | Apr 17, 2024 | FormLanguage Model Evaluation | CodeCode Available | 0 | 5 |
| Reasoning over Uncertain Text by Generative Large Language Models | Feb 14, 2024 | Decision MakingMathematical Reasoning | CodeCode Available | 0 | 5 |
| Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange | Mar 30, 2024 | MathMathematical Problem-Solving | CodeCode Available | 0 | 5 |
| On-Policy RL with Optimal Reward Baseline | May 29, 2025 | Large Language ModelMathematical Reasoning | CodeCode Available | 0 | 5 |
| Omni-DPO: A Dual-Perspective Paradigm for Dynamic Preference Learning of LLMs | Jun 11, 2025 | Mathematical Reasoning | CodeCode Available | 0 | 5 |
| Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction | Jun 2, 2024 | Mathematical Reasoning | CodeCode Available | 0 | 5 |
| Not All Votes Count! Programs as Verifiers Improve Self-Consistency of Language Models for Math Reasoning | Oct 16, 2024 | AllGSM8K | CodeCode Available | 0 | 5 |
| NUMCoT: Numerals and Units of Measurement in Chain-of-Thought Reasoning using Large Language Models | Jun 5, 2024 | MathMathematical Reasoning | CodeCode Available | 0 | 5 |
| Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical Supervision | May 26, 2025 | HallucinationMath | CodeCode Available | 0 | 5 |