| DebFlow: Automating Agent Creation via Agent Debate | Mar 31, 2025 | Math | —Unverified | 0 |
| ToRL: Scaling Tool-Integrated RL | Mar 30, 2025 | Mathreinforcement-learning | CodeCode Available | 3 |
| Learning to Reason for Long-Form Story Generation | Mar 28, 2025 | FormMath | CodeCode Available | 2 |
| QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks? | Mar 28, 2025 | Logical ReasoningMath | CodeCode Available | 1 |
| CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models | Mar 28, 2025 | GPUGSM8K | CodeCode Available | 2 |
| ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models | Mar 27, 2025 | Math | CodeCode Available | 1 |
| Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad | Mar 27, 2025 | MathMathematical Reasoning | —Unverified | 0 |
| Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models | Mar 27, 2025 | Data VisualizationMath | CodeCode Available | 0 |
| Effective Skill Unlearning through Intervention and Abstention | Mar 27, 2025 | General KnowledgeMath | CodeCode Available | 0 |
| Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators | Mar 25, 2025 | Math | —Unverified | 0 |