| Implicit Chain of Thought Reasoning via Knowledge Distillation | Nov 2, 2023 | Knowledge DistillationMath | CodeCode Available | 1 |
| Design of Chain-of-Thought in Math Problem Solving | Sep 20, 2023 | DiversityGSM8K | CodeCode Available | 1 |
| How well do Large Language Models perform in Arithmetic tasks? | Mar 16, 2023 | Math | CodeCode Available | 1 |
| Improving the Validity of Automatically Generated Feedback via Reinforcement Learning | Mar 2, 2024 | MathMisconceptions | CodeCode Available | 1 |
| How to Get Your LLM to Generate Challenging Problems for Evaluation | Feb 20, 2025 | Code CompletionMath | CodeCode Available | 1 |
| DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase for Math Reasoning | Jun 6, 2024 | Math | CodeCode Available | 1 |
| Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities | Feb 17, 2025 | Code GenerationHumanEval | CodeCode Available | 1 |
| HARP: A challenging human-annotated math reasoning benchmark | Dec 11, 2024 | Math | CodeCode Available | 1 |
| HARDMath: A Benchmark Dataset for Challenging Problems in Applied Mathematics | Oct 13, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| DocMath-Eval: Evaluating Math Reasoning Capabilities of LLMs in Understanding Long and Specialized Documents | Nov 16, 2023 | Math | CodeCode Available | 1 |