| MMATH: A Multilingual Benchmark for Mathematical Reasoning | May 25, 2025 | MathMathematical Reasoning | CodeCode Available | 0 | 5 |
| Analysis of Optimization Algorithms via Sum-of-Squares | Jun 11, 2019 | Math | CodeCode Available | 0 | 5 |
| MM-Eval: A Hierarchical Benchmark for Modern Mongolian Evaluation in LLMs | Nov 14, 2024 | General KnowledgeMath | CodeCode Available | 0 | 5 |
| Misplaced Trust: Measuring the Interference of Machine Learning in Human Decision-Making | May 22, 2020 | BIG-bench Machine LearningDecision Making | CodeCode Available | 0 | 5 |
| Automatic Generation of Headlines for Online Math Questions | Nov 27, 2019 | Math | CodeCode Available | 0 | 5 |
| MIRB: Mathematical Information Retrieval Benchmark | May 21, 2025 | Automated Theorem ProvingInformation Retrieval | CodeCode Available | 0 | 5 |
| Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language Models | May 30, 2025 | MathMultiple-choice | CodeCode Available | 0 | 5 |
| Mind Scramble: Unveiling Large Language Model Psychology Via Typoglycemia | Oct 2, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 | 5 |
| Analogical Math Word Problems Solving with Enhanced Problem-Solution Association | Dec 1, 2022 | MathQuestion Answering | CodeCode Available | 0 | 5 |
| HAPO: Training Language Models to Reason Concisely via History-Aware Policy Optimization | May 16, 2025 | Math | CodeCode Available | 0 | 5 |