| How Is LLM Reasoning Distracted by Irrelevant Context? An Analysis Using a Controlled Benchmark | May 24, 2025 | Math | CodeCode Available | 0 |
| How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study | May 21, 2025 | Math | CodeCode Available | 0 |
| World Models for Math Story Problems | Jun 7, 2023 | Math | CodeCode Available | 0 |
| One Language, Many Gaps: Evaluating Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks | Oct 14, 2024 | FairnessGSM8K | CodeCode Available | 0 |
| ChatBench: From Static Benchmarks to Human-AI Evaluation | Mar 22, 2025 | MathMMLU | CodeCode Available | 0 |
| Augmented Math: Authoring AR-Based Explorable Explanations by Augmenting Static Math Textbooks | Jul 30, 2023 | MathOptical Character Recognition | CodeCode Available | 0 |
| When an LLM is apprehensive about its answers -- and when its uncertainty is justified | Mar 3, 2025 | MathMMLU | CodeCode Available | 0 |
| Can Large Language Models Replicate ITS Feedback on Open-Ended Math Questions? | May 10, 2024 | Mathtext similarity | CodeCode Available | 0 |
| Skellam Mixture Mechanism: a Novel Approach to Federated Learning with Differential Privacy | Dec 8, 2022 | Federated LearningMath | CodeCode Available | 0 |
| Classifying Math KCs via Task-Adaptive Pre-Trained BERT | May 24, 2021 | MathPrediction | CodeCode Available | 0 |