| KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding | Mar 4, 2025 | HumanEvalmbpp | CodeCode Available | 3 |
| Isolating Language-Coding from Problem-Solving: Benchmarking LLMs with PseudoEval | Feb 26, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models | Feb 23, 2025 | Code GenerationHumanEval | CodeCode Available | 1 |
| Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning | Feb 19, 2025 | mbpp | —Unverified | 0 |
| UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance | Feb 17, 2025 | Code GenerationHumanEval | —Unverified | 0 |
| MasRouter: Learning to Route LLMs for Multi-Agent Systems | Feb 16, 2025 | HumanEvalmbpp | CodeCode Available | 2 |
| What I cannot execute, I do not understand: Training and Evaluating LLMs on Program Execution Traces | Feb 10, 2025 | Code Generationmbpp | —Unverified | 0 |
| CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging | Feb 8, 2025 | Code GenerationHumanEval | CodeCode Available | 2 |
| Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment | Feb 5, 2025 | GSM8KHumanEval | —Unverified | 0 |
| Learning to Generate Unit Tests for Automated Debugging | Feb 3, 2025 | HumanEvalLarge Language Model | CodeCode Available | 1 |