| Nexus: A Lightweight and Scalable Multi-Agent Framework for Complex Tasks Automation | Feb 26, 2025 | Code GenerationHumanEval | CodeCode Available | 2 | 5 |
| Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction | Apr 21, 2025 | Math | CodeCode Available | 2 | 5 |
| VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models | Oct 10, 2024 | Math | CodeCode Available | 2 | 5 |
| VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment | Oct 2, 2024 | GSM8KMath | CodeCode Available | 2 | 5 |
| Mathfish: Evaluating Language Model Math Reasoning via Grounding in Educational Curricula | Aug 8, 2024 | GSM8KLanguage Modeling | CodeCode Available | 1 | 5 |
| A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods | Feb 3, 2025 | MathMathematical Reasoning | CodeCode Available | 1 | 5 |
| A Practical Two-Stage Recipe for Mathematical LLMs: Maximizing Accuracy with SFT and Efficiency with Reinforcement Learning | Jul 11, 2025 | MathMathematical Reasoning | CodeCode Available | 1 | 5 |
| Evaluating and Improving Tool-Augmented Computation-Intensive Math Reasoning | Jun 4, 2023 | Math | CodeCode Available | 1 | 5 |
| EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees | Mar 11, 2025 | ChatbotLanguage Modeling | CodeCode Available | 1 | 5 |
| Building Dataset for Grounding of Formulae — Annotating Coreference Relations Among Math Identifiers | Jun 1, 2022 | Math | CodeCode Available | 1 | 5 |