| Toward Adaptive Reasoning in Large Language Models with Thought Rollback | Jul 21, 2024 | Arithmetic ReasoningMath | CodeCode Available | 1 |
| Building Dataset for Grounding of Formulae — Annotating Coreference Relations Among Math Identifiers | Jun 1, 2022 | Math | CodeCode Available | 1 |
| Towards an AI to Win Ghana's National Science and Maths Quiz | Aug 8, 2023 | MathQuestion Answering | CodeCode Available | 1 |
| Large Language Models Are Neurosymbolic Reasoners | Jan 17, 2024 | Common Sense ReasoningMath | CodeCode Available | 1 |
| How to Get Your LLM to Generate Challenging Problems for Evaluation | Feb 20, 2025 | Code CompletionMath | CodeCode Available | 1 |
| Entropy-Based Adaptive Weighting for Self-Training | Mar 31, 2025 | GSM8KMath | CodeCode Available | 1 |
| MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation | Dec 28, 2023 | GSM8KLanguage Model Evaluation | CodeCode Available | 1 |
| How well do Large Language Models perform in Arithmetic tasks? | Mar 16, 2023 | Math | CodeCode Available | 1 |
| Harnessing Negative Signals: Reinforcement Distillation from Teacher Data for LLM Reasoning | May 30, 2025 | MathMathematical Reasoning | CodeCode Available | 1 |
| HARDMath: A Benchmark Dataset for Challenging Problems in Applied Mathematics | Oct 13, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| HARP: A challenging human-annotated math reasoning benchmark | Dec 11, 2024 | Math | CodeCode Available | 1 |
| Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems | Apr 23, 2024 | Arithmetic ReasoningGSM8K | CodeCode Available | 1 |
| HALO: Hierarchical Autonomous Logic-Oriented Orchestration for Multi-Agent LLM Systems | May 17, 2025 | Arithmetic ReasoningCode Generation | CodeCode Available | 1 |
| Can an AI Win Ghana's National Science and Maths Quiz? An AI Grand Challenge for Education | Jan 30, 2023 | MathPosition | CodeCode Available | 1 |
| Generating Pedagogically Meaningful Visuals for Math Word Problems: A New Benchmark and Analysis of Text-to-Image Models | Jun 4, 2025 | Math | CodeCode Available | 1 |
| Escape Sky-high Cost: Early-stopping Self-Consistency for Multi-step Reasoning | Jan 19, 2024 | GSM8KMath | CodeCode Available | 1 |
| Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles | Jun 18, 2024 | Arithmetic ReasoningCode Generation | CodeCode Available | 1 |
| EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees | Mar 11, 2025 | ChatbotLanguage Modeling | CodeCode Available | 1 |
| Evaluating and Improving Tool-Augmented Computation-Intensive Math Reasoning | Jun 4, 2023 | Math | CodeCode Available | 1 |
| Implicit Chain of Thought Reasoning via Knowledge Distillation | Nov 2, 2023 | Knowledge DistillationMath | CodeCode Available | 1 |
| Are NLP Models really able to Solve Simple Math Word Problems? | Mar 12, 2021 | MathMath Word Problem Solving | CodeCode Available | 1 |
| Case-Based or Rule-Based: How Do Transformers Do the Math? | Feb 27, 2024 | MathSystematic Generalization | CodeCode Available | 1 |
| Graph-to-Tree Neural Networks for Learning Structured Input-Output Translation with Applications to Semantic Parsing and Math Word Problem | Apr 7, 2020 | DecoderMachine Translation | CodeCode Available | 1 |
| CARL-GT: Evaluating Causal Reasoning Capabilities of Large Language Models | Dec 23, 2024 | Decision MakingMath | CodeCode Available | 1 |
| AI Coders Are Among Us: Rethinking Programming Language Grammar Towards Efficient Code Generation | Apr 25, 2024 | Code GenerationMath | CodeCode Available | 1 |