| MIRB: Mathematical Information Retrieval Benchmark | May 21, 2025 | Automated Theorem ProvingInformation Retrieval | CodeCode Available | 0 |
| Misplaced Trust: Measuring the Interference of Machine Learning in Human Decision-Making | May 22, 2020 | BIG-bench Machine LearningDecision Making | CodeCode Available | 0 |
| Distinguishing affixoid formations from compounds | Aug 1, 2018 | ManagementMath | CodeCode Available | 0 |
| Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language Models | May 30, 2025 | MathMultiple-choice | CodeCode Available | 0 |
| Enhancing the Transformer with Explicit Relational Encoding for Math Problem Solving | Oct 15, 2019 | MathQuestion Answering | CodeCode Available | 0 |
| AutoTutor meets Large Language Models: A Language Model Tutor with Rich Pedagogy and Guardrails | Feb 14, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| MMATH: A Multilingual Benchmark for Mathematical Reasoning | May 25, 2025 | MathMathematical Reasoning | CodeCode Available | 0 |
| Learning a Continue-Thinking Token for Enhanced Test-Time Scaling | Jun 12, 2025 | GSM8KMath | CodeCode Available | 0 |
| Algebra Error Classification with Large Language Models | May 8, 2023 | ClassificationMath | CodeCode Available | 0 |
| MM-Eval: A Hierarchical Benchmark for Modern Mongolian Evaluation in LLMs | Nov 14, 2024 | General KnowledgeMath | CodeCode Available | 0 |