| Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks | May 12, 2025 | Code Generation | CodeCode Available | 3 | 5 |
| Automatic Instruction Evolving for Large Language Models | Jun 2, 2024 | GSM8KHumanEval | CodeCode Available | 3 | 5 |
| SelfCodeAlign: Self-Alignment for Code Generation | Oct 31, 2024 | Code GenerationHumanEval | CodeCode Available | 3 | 5 |
| LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding | Apr 25, 2024 | GSM8KHellaSwag | CodeCode Available | 3 | 5 |
| Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation | May 2, 2023 | Code GenerationHumanEval | CodeCode Available | 3 | 5 |
| OctoPack: Instruction Tuning Code Large Language Models | Aug 14, 2023 | Code GenerationCode Repair | CodeCode Available | 3 | 5 |
| Evaluating Large Language Models Trained on Code | Jul 7, 2021 | Code GenerationHumanEval | CodeCode Available | 3 | 5 |
| KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding | Mar 4, 2025 | HumanEvalmbpp | CodeCode Available | 3 | 5 |
| CodeT: Code Generation with Generated Tests | Jul 21, 2022 | Code GenerationHumanEval | CodeCode Available | 2 | 5 |
| CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging | Feb 8, 2025 | Code GenerationHumanEval | CodeCode Available | 2 | 5 |