| EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees | Mar 11, 2025 | ChatbotLanguage Modeling | CodeCode Available | 1 |
| Efficient RL Training for Reasoning Models via Length-Aware Optimization | May 18, 2025 | Math | CodeCode Available | 1 |
| Injecting Numerical Reasoning Skills into Language Models | Apr 9, 2020 | Data AugmentationDecoder | CodeCode Available | 1 |
| Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning | May 12, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| How well do Large Language Models perform in Arithmetic tasks? | Mar 16, 2023 | Math | CodeCode Available | 1 |
| Eliminating Position Bias of Language Models: A Mechanistic Approach | Jul 1, 2024 | Mathobject-detection | CodeCode Available | 1 |
| How to Get Your LLM to Generate Challenging Problems for Evaluation | Feb 20, 2025 | Code CompletionMath | CodeCode Available | 1 |
| Implicit Chain of Thought Reasoning via Knowledge Distillation | Nov 2, 2023 | Knowledge DistillationMath | CodeCode Available | 1 |
| Improving the Validity of Automatically Generated Feedback via Reinforcement Learning | Mar 2, 2024 | MathMisconceptions | CodeCode Available | 1 |
| ChatCoT: Tool-Augmented Chain-of-Thought Reasoning on Chat-based Large Language Models | May 23, 2023 | Math | CodeCode Available | 1 |
| ArMATH: a Dataset for Solving Arabic Math Word Problems | Jun 1, 2022 | Deep LearningMath | CodeCode Available | 1 |
| Evaluating and Improving Tool-Augmented Computation-Intensive Math Reasoning | Jun 4, 2023 | Math | CodeCode Available | 1 |
| Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics | Oct 28, 2024 | Arithmetic ReasoningMath | CodeCode Available | 1 |
| Teaching Language Models to Self-Improve through Interactive Demonstrations | Oct 20, 2023 | Math | CodeCode Available | 1 |
| Entropy-Regularized Process Reward Model | Dec 15, 2024 | GSM8KMath | CodeCode Available | 1 |
| Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization | Aug 14, 2024 | InformativenessInstruction Following | CodeCode Available | 1 |
| Enhancing Cross-Tokenizer Knowledge Distillation with Contextual Dynamical Mapping | Feb 16, 2025 | Code GenerationInstruction Following | CodeCode Available | 1 |
| Ape210K: A Large-Scale and Template-Rich Dataset of Math Word Problems | Sep 24, 2020 | DiversityMath | CodeCode Available | 1 |
| The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models? | Mar 14, 2024 | Hallucinationimage-classification | CodeCode Available | 1 |
| The Geometry of Concepts: Sparse Autoencoder Feature Structure | Oct 10, 2024 | Math | CodeCode Available | 1 |
| HALO: Hierarchical Autonomous Logic-Oriented Orchestration for Multi-Agent LLM Systems | May 17, 2025 | Arithmetic ReasoningCode Generation | CodeCode Available | 1 |
| Brilla AI: AI Contestant for the National Science and Maths Quiz | Mar 4, 2024 | MathQuestion Answering | CodeCode Available | 1 |
| HARDMath: A Benchmark Dataset for Challenging Problems in Applied Mathematics | Oct 13, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation | Dec 28, 2023 | GSM8KLanguage Model Evaluation | CodeCode Available | 1 |
| Entropy-Based Adaptive Weighting for Self-Training | Mar 31, 2025 | GSM8KMath | CodeCode Available | 1 |