| AI4Math: A Native Spanish Benchmark for University-Level Mathematical Reasoning in Large Language Models | May 25, 2025 | MathMathematical Reasoning | —Unverified | 0 |
| Hawkeye:Efficient Reasoning with Model Collaboration | Apr 1, 2025 | Mathmodel | —Unverified | 0 |
| HelpSteer3: Human-Annotated Feedback and Edit Data to Empower Inference-Time Scaling in Open-Ended General-Domain Tasks | Mar 6, 2025 | ChatbotLogical Reasoning | —Unverified | 0 |
| hep-th | Jun 27, 2018 | Binary ClassificationMath | —Unverified | 0 |
| Evaluating the Design Features of an Intelligent Tutoring System for Advanced Mathematics Learning | Dec 23, 2024 | Math | —Unverified | 0 |
| Evaluating Robustness of Reward Models for Mathematical Reasoning | Oct 2, 2024 | MathMathematical Reasoning | —Unverified | 0 |
| Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation | May 29, 2025 | GSM8KMath | —Unverified | 0 |
| A Graph-Based Synthetic Data Pipeline for Scaling High-Quality Reasoning Instructions | Dec 12, 2024 | GSM8KKnowledge Graphs | —Unverified | 0 |
| Evaluating Grounded Reasoning by Code-Assisted Large Language Models for Mathematics | Apr 24, 2025 | Code GenerationMath | —Unverified | 0 |
| Evaluating GPT-4 at Grading Handwritten Solutions in Math Exams | Nov 7, 2024 | Math | —Unverified | 0 |