| How is ChatGPT's behavior changing over time? | Jul 18, 2023 | Code GenerationLanguage Modelling | CodeCode Available | 4 | 5 |
| InternLM2.5-StepProver: Advancing Automated Theorem Proving via Expert Iteration on Large-Scale LEAN Problems | Oct 21, 2024 | Automated Theorem ProvingCPU | CodeCode Available | 4 | 5 |
| Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond | Mar 13, 2025 | Domain GeneralizationMath | CodeCode Available | 4 | 5 |
| CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction | Feb 11, 2025 | Code GenerationMath | CodeCode Available | 4 | 5 |
| ReFT: Reasoning with Reinforced Fine-Tuning | Jan 17, 2024 | GSM8KMath | CodeCode Available | 4 | 5 |
| PAL: Program-aided Language Models | Nov 18, 2022 | Arithmetic ReasoningGSM8K | CodeCode Available | 3 | 5 |
| Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling | Feb 10, 2025 | Math | CodeCode Available | 3 | 5 |
| MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities | Aug 1, 2024 | MathMM-Vet | CodeCode Available | 3 | 5 |
| Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning | May 1, 2024 | ARCGSM8K | CodeCode Available | 3 | 5 |
| MuMath-Code: Combining Tool-Use Large Language Models with Multi-perspective Data Augmentation for Mathematical Reasoning | May 13, 2024 | Data AugmentationGSM8K | CodeCode Available | 3 | 5 |