| HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate Class | May 17, 2025 | MathMathematical Problem-Solving | CodeCode Available | 0 | 5 |
| Guided Stream of Search: Learning to Better Search with Language Models via Optimal Path Guidance | Oct 3, 2024 | Mathematical Reasoning | CodeCode Available | 0 | 5 |
| Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on Large Language Models | Jun 18, 2024 | Mathematical Reasoning | CodeCode Available | 0 | 5 |
| GThinker: Towards General Multimodal Reasoning via Cue-Guided Rethinking | Jun 1, 2025 | 4kMath | CodeCode Available | 0 | 5 |
| Compositional Processing Emerges in Neural Networks Solving Math Problems | May 19, 2021 | MathMathematical Reasoning | CodeCode Available | 0 | 5 |
| ComplexFormer: Disruptively Advancing Transformer Inference Ability via Head-Specific Complex Vector Attention | May 15, 2025 | Code GenerationLanguage Modeling | CodeCode Available | 0 | 5 |
| MAF: Multi-Aspect Feedback for Improving Reasoning in Large Language Models | Oct 19, 2023 | HallucinationMathematical Reasoning | CodeCode Available | 0 | 5 |
| MAQA: Evaluating Uncertainty Quantification in LLMs Regarding Data Uncertainty | Aug 13, 2024 | Mathematical ReasoningQuestion Answering | CodeCode Available | 0 | 5 |
| MARGE: Improving Math Reasoning for LLMs with Guided Exploration | May 18, 2025 | MathMathematical Reasoning | CodeCode Available | 0 | 5 |
| Give me a hint: Can LLMs take a hint to solve math problems? | Oct 8, 2024 | Adversarial RobustnessMath | CodeCode Available | 0 | 5 |