| UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models | Feb 1, 2025 | Math | CodeCode Available | 2 |
| Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate | Jan 29, 2025 | Instruction FollowingMath | CodeCode Available | 2 |
| Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling | Jan 20, 2025 | Imitation LearningLanguage Modeling | CodeCode Available | 2 |
| URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics | Jan 8, 2025 | MathMathematical Reasoning | CodeCode Available | 2 |
| Offline Reinforcement Learning for LLM Multi-Step Reasoning | Dec 20, 2024 | GSM8KMath | CodeCode Available | 2 |
| ProcessBench: Identifying Process Errors in Mathematical Reasoning | Dec 9, 2024 | GSM8KMath | CodeCode Available | 2 |
| Preference Optimization for Reasoning with Pseudo Feedback | Nov 25, 2024 | GSM8KMath | CodeCode Available | 2 |
| LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training | Nov 24, 2024 | MathMixture-of-Experts | CodeCode Available | 2 |
| Enhancing Reasoning Capabilities of LLMs via Principled Synthetic Logic Corpus | Nov 19, 2024 | Formal LogicLogical Reasoning | CodeCode Available | 2 |
| Flaming-hot Initiation with Regular Execution Sampling for Large Language Models | Oct 28, 2024 | DiversityMath | CodeCode Available | 2 |