| Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch | Oct 24, 2024 | MathMathematical Reasoning | CodeCode Available | 2 |
| A Comparative Study on Reasoning Patterns of OpenAI's o1 Model | Oct 17, 2024 | Math | CodeCode Available | 2 |
| JudgeBench: A Benchmark for Evaluating LLM-based Judges | Oct 16, 2024 | Math | CodeCode Available | 2 |
| Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization | Oct 11, 2024 | GSM8KLanguage Modeling | CodeCode Available | 2 |
| MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code | Oct 10, 2024 | MathMathematical Reasoning | CodeCode Available | 2 |
| Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models | Oct 10, 2024 | GSM8KMath | CodeCode Available | 2 |
| VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models | Oct 10, 2024 | Math | CodeCode Available | 2 |
| Steering Large Language Models between Code Execution and Textual Reasoning | Oct 4, 2024 | Code GenerationMath | CodeCode Available | 2 |
| VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment | Oct 2, 2024 | GSM8KMath | CodeCode Available | 2 |
| Archon: An Architecture Search Framework for Inference-Time Techniques | Sep 23, 2024 | Hyperparameter OptimizationInstruction Following | CodeCode Available | 2 |