| Galactica: A Large Language Model for Science | Nov 16, 2022 | AnachronismsBias Detection | CodeCode Available | 4 | 5 |
| ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates | Feb 10, 2025 | Hierarchical Reinforcement LearningLanguage Modeling | CodeCode Available | 4 | 5 |
| How is ChatGPT's behavior changing over time? | Jul 18, 2023 | Code GenerationLanguage Modelling | CodeCode Available | 4 | 5 |
| Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers | Aug 12, 2024 | GSM8KMath | CodeCode Available | 4 | 5 |
| Energy-Based Transformers are Scalable Learners and Thinkers | Jul 2, 2025 | DenoisingImage Denoising | CodeCode Available | 4 | 5 |
| OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset | Feb 15, 2024 | Arithmetic ReasoningGSM8K | CodeCode Available | 4 | 5 |
| MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision | May 19, 2025 | MathMathematical Reasoning | CodeCode Available | 4 | 5 |
| LLaMA Pro: Progressive LLaMA with Block Expansion | Jan 4, 2024 | Instruction FollowingMath | CodeCode Available | 4 | 5 |
| InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning | Feb 9, 2024 | Data AugmentationGSM8K | CodeCode Available | 4 | 5 |
| AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset | Apr 23, 2025 | MathMathematical Reasoning | CodeCode Available | 4 | 5 |
| Dive into Deep Learning | Jun 21, 2021 | Deep LearningMath | CodeCode Available | 4 | 5 |
| MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine | Jul 11, 2024 | Contrastive LearningLanguage Modelling | CodeCode Available | 4 | 5 |
| Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models | Jun 9, 2022 | Common Sense ReasoningMath | CodeCode Available | 4 | 5 |
| OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data | Oct 2, 2024 | Arithmetic ReasoningLarge Language Model | CodeCode Available | 4 | 5 |
| InternLM2.5-StepProver: Advancing Automated Theorem Proving via Expert Iteration on Large-Scale LEAN Problems | Oct 21, 2024 | Automated Theorem ProvingCPU | CodeCode Available | 4 | 5 |
| Thinkless: LLM Learns When to Think | May 19, 2025 | GSM8KMath | CodeCode Available | 3 | 5 |
| ThoughtSource: A central hub for large language model reasoning data | Jan 27, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 3 | 5 |
| Syzygy of Thoughts: Improving LLM CoT with the Minimal Free Resolution | Apr 13, 2025 | GSM8KMath | CodeCode Available | 3 | 5 |
| TaskGen: A Task-Based, Memory-Infused Agentic Framework using StrictJSON | Jul 22, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 3 | 5 |
| ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving | Sep 29, 2023 | Arithmetic ReasoningComputational Efficiency | CodeCode Available | 3 | 5 |
| MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities | Aug 1, 2024 | MathMM-Vet | CodeCode Available | 3 | 5 |
| SymForce: Symbolic Computation and Code Generation for Robotics | Apr 17, 2022 | Code GenerationMath | CodeCode Available | 3 | 5 |
| ToRL: Scaling Tool-Integrated RL | Mar 30, 2025 | Mathreinforcement-learning | CodeCode Available | 3 | 5 |
| Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs | Jun 26, 2024 | Arithmetic ReasoningGSM8K | CodeCode Available | 3 | 5 |
| BAdam: A Memory Efficient Full Parameter Optimization Method for Large Language Models | Apr 3, 2024 | GPUMath | CodeCode Available | 3 | 5 |
| Step-level Value Preference Optimization for Mathematical Reasoning | Jun 16, 2024 | Learning-To-RankMath | CodeCode Available | 3 | 5 |
| Large Language Monkeys: Scaling Inference Compute with Repeated Sampling | Jul 31, 2024 | GSM8KMath | CodeCode Available | 3 | 5 |
| LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding | Apr 25, 2024 | GSM8KHellaSwag | CodeCode Available | 3 | 5 |
| Self-Discover: Large Language Models Self-Compose Reasoning Structures | Feb 6, 2024 | Math | CodeCode Available | 3 | 5 |
| Learning to Reason under Off-Policy Guidance | Apr 21, 2025 | MathReinforcement Learning (RL) | CodeCode Available | 3 | 5 |
| Llemma: An Open Language Model For Mathematics | Oct 16, 2023 | Arithmetic ReasoningAutomated Theorem Proving | CodeCode Available | 3 | 5 |
| Spurious Rewards: Rethinking Training Signals in RLVR | Jun 12, 2025 | MathMathematical Reasoning | CodeCode Available | 3 | 5 |
| Training Verifiers to Solve Math Word Problems | Oct 27, 2021 | GSM8KMath | CodeCode Available | 3 | 5 |
| Reinforcement Learning for Reasoning in Large Language Models with One Training Example | Apr 29, 2025 | Domain GeneralizationMath | CodeCode Available | 3 | 5 |
| How Abilities in Large Language Models are Affected by Supervised Fine-tuning Data Composition | Oct 9, 2023 | Code GenerationInstruction Following | CodeCode Available | 3 | 5 |
| Rho-1: Not All Tokens Are What You Need | Apr 11, 2024 | AllContinual Pretraining | CodeCode Available | 3 | 5 |
| General-Reasoner: Advancing LLM Reasoning Across All Domains | May 20, 2025 | AllMath | CodeCode Available | 3 | 5 |
| Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks | Nov 22, 2022 | Math | CodeCode Available | 3 | 5 |
| Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving | Feb 11, 2025 | Automated Theorem ProvingLarge Language Model | CodeCode Available | 3 | 5 |
| PAL: Program-aided Language Models | Nov 18, 2022 | Arithmetic ReasoningGSM8K | CodeCode Available | 3 | 5 |
| RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation | Jan 9, 2024 | GPUMath | CodeCode Available | 3 | 5 |
| MuMath-Code: Combining Tool-Use Large Language Models with Multi-perspective Data Augmentation for Mathematical Reasoning | May 13, 2024 | Data AugmentationGSM8K | CodeCode Available | 3 | 5 |
| Noise Contrastive Alignment of Language Models with Explicit Rewards | Feb 8, 2024 | Language ModellingMath | CodeCode Available | 3 | 5 |
| Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning | May 1, 2024 | ARCGSM8K | CodeCode Available | 3 | 5 |
| MiLoRA: Harnessing Minor Singular Components for Parameter-Efficient LLM Finetuning | Jun 13, 2024 | Instruction FollowingMath | CodeCode Available | 3 | 5 |
| Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory | Apr 10, 2025 | MathMMLU | CodeCode Available | 3 | 5 |
| Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling | Feb 10, 2025 | Math | CodeCode Available | 3 | 5 |
| Scaling up Masked Diffusion Models on Text | Oct 24, 2024 | GSM8KLanguage Modeling | CodeCode Available | 3 | 5 |
| Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models | Feb 12, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 3 | 5 |
| MathArena: Evaluating LLMs on Uncontaminated Math Competitions | May 29, 2025 | MathMathematical Reasoning | CodeCode Available | 3 | 5 |