| VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models | Jan 9, 2025 | BenchmarkingMathematical Problem-Solving | CodeCode Available | 1 |
| A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets | May 29, 2023 | Bias DetectionCode Generation | CodeCode Available | 1 |
| SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning | Jun 10, 2025 | Knowledge DistillationMath | CodeCode Available | 1 |
| Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities | Feb 17, 2025 | Code GenerationHumanEval | CodeCode Available | 1 |
| Evaluating Language Models for Mathematics through Interactions | Jun 2, 2023 | Language ModellingMathematical Problem-Solving | CodeCode Available | 1 |
| Open Eyes, Then Reason: Fine-grained Visual Mathematical Understanding in MLLMs | Jan 11, 2025 | MathMathematical Problem-Solving | CodeCode Available | 1 |
| MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning | Jun 5, 2025 | Dataset GenerationMathematical Problem-Solving | CodeCode Available | 1 |
| Exposing Numeracy Gaps: A Benchmark to Evaluate Fundamental Numerical Abilities in Large Language Models | Feb 16, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| MathCAMPS: Fine-grained Synthesis of Mathematical Problems From Human Curricula | Jul 1, 2024 | Mathematical Problem-Solving | CodeCode Available | 1 |
| RaDeR: Reasoning-aware Dense Retrieval Models | May 23, 2025 | MathMathematical Problem-Solving | CodeCode Available | 1 |
| Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks | Apr 23, 2024 | Mathematical Problem-SolvingQuestion Answering | CodeCode Available | 1 |
| Forgotten Polygons: Multimodal Large Language Models are Shape-Blind | Feb 21, 2025 | MathMathematical Problem-Solving | CodeCode Available | 1 |
| Advancing Reasoning in Large Language Models: Promising Methods and Approaches | Feb 5, 2025 | Mathematical Problem-SolvingSurvey | —Unverified | 0 |
| Reasoning with OmniThought: A Large CoT Dataset with Verbosity and Cognitive Difficulty Annotations | May 16, 2025 | Code GenerationMathematical Problem-Solving | —Unverified | 0 |
| Bayesian artificial brain with ChatGPT | Aug 28, 2023 | Mathematical Problem-Solving | —Unverified | 0 |
| MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task | Feb 17, 2025 | Code CompletionGSM8K | —Unverified | 0 |
| Large Language Models for Mathematical Reasoning: Progresses and Challenges | Jan 31, 2024 | DiversityMath | —Unverified | 0 |
| Kwai-STaR: Transform LLMs into State-Transition Reasoners | Nov 7, 2024 | GSM8KMathematical Problem-Solving | —Unverified | 0 |
| Automating Mathematical Proof Generation Using Large Language Model Agents and Knowledge Graphs | Feb 4, 2025 | Formal LogicKnowledge Graphs | —Unverified | 0 |
| JiuZhang 2.0: A Unified Chinese Pre-trained Language Model for Multi-task Mathematical Problem Solving | Jun 19, 2023 | In-Context LearningLanguage Modeling | —Unverified | 0 |
| Improving Small-Scale Large Language Models Function Calling for Reasoning Tasks | Oct 24, 2024 | Logical ReasoningMathematical Problem-Solving | —Unverified | 0 |
| How Do Large Language Monkeys Get Their Power (Laws)? | Feb 24, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Holistic Capability Preservation: Towards Compact Yet Comprehensive Reasoning Models | Apr 9, 2025 | Instruction FollowingMathematical Problem-Solving | —Unverified | 0 |
| Can reasoning models comprehend mathematical problems in Chinese ancient texts? An empirical study based on data from Suanjing Shishu | May 22, 2025 | Mathematical Problem-Solving | —Unverified | 0 |
| Can LLMs plan paths with extra hints from solvers? | Oct 7, 2024 | Mathematical Problem-SolvingProgram Synthesis | —Unverified | 0 |