| OctoPack: Instruction Tuning Code Large Language Models | Aug 14, 2023 | Code GenerationCode Repair | CodeCode Available | 3 |
| Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation | May 2, 2023 | Code GenerationHumanEval | CodeCode Available | 3 |
| Evaluating Large Language Models Trained on Code | Jul 7, 2021 | Code GenerationHumanEval | CodeCode Available | 3 |
| any4: Learned 4-bit Numeric Representation for LLMs | Jul 7, 2025 | GPUGSM8K | CodeCode Available | 2 |
| Nexus: A Lightweight and Scalable Multi-Agent Framework for Complex Tasks Automation | Feb 26, 2025 | Code GenerationHumanEval | CodeCode Available | 2 |
| MasRouter: Learning to Route LLMs for Multi-Agent Systems | Feb 16, 2025 | HumanEvalmbpp | CodeCode Available | 2 |
| CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging | Feb 8, 2025 | Code GenerationHumanEval | CodeCode Available | 2 |
| From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging | Oct 2, 2024 | Auto DebuggingBug fixing | CodeCode Available | 2 |
| Training Language Models to Self-Correct via Reinforcement Learning | Sep 19, 2024 | HumanEvalMath | CodeCode Available | 2 |
| A Survey on Large Language Models for Code Generation | Jun 1, 2024 | Code GenerationHumanEval | CodeCode Available | 2 |
| MapCoder: Multi-Agent Code Generation for Competitive Problem Solving | May 18, 2024 | Code GenerationHumanEval | CodeCode Available | 2 |
| NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts | May 7, 2024 | HumanEvalmbpp | CodeCode Available | 2 |
| Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM | Mar 28, 2024 | Code GenerationHumanEval | CodeCode Available | 2 |
| AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation | Dec 20, 2023 | Code GenerationHumanEval | CodeCode Available | 2 |
| Rethinking Benchmark and Contamination for Language Models with Rephrased Samples | Nov 8, 2023 | HumanEvalMMLU | CodeCode Available | 2 |
| Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models | Oct 6, 2023 | Code GenerationDecision Making | CodeCode Available | 2 |
| Parsel: Algorithmic Reasoning with Language Models by Composing Decompositions | Dec 20, 2022 | Automated Theorem ProvingCode Generation | CodeCode Available | 2 |
| MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation | Aug 17, 2022 | BenchmarkingCode Generation | CodeCode Available | 2 |
| CodeT: Code Generation with Generated Tests | Jul 21, 2022 | Code GenerationHumanEval | CodeCode Available | 2 |
| Rethinking Verification for LLM Code Generation: From Generation to Testing | Jul 9, 2025 | Code GenerationHumanEval | CodeCode Available | 1 |
| Invisible Entropy: Towards Safe and Efficient Low-Entropy LLM Watermarking | May 20, 2025 | HumanEvalmbpp | CodeCode Available | 1 |
| HALO: Hierarchical Autonomous Logic-Oriented Orchestration for Multi-Agent LLM Systems | May 17, 2025 | Arithmetic ReasoningCode Generation | CodeCode Available | 1 |
| Rethinking Repetition Problems of LLMs in Code Generation | May 15, 2025 | Code GenerationHumanEval | CodeCode Available | 1 |
| Rewriting Pre-Training Data Boosts LLM Performance in Math and Code | May 5, 2025 | Code GenerationGSM8K | CodeCode Available | 1 |
| RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing | Mar 10, 2025 | Code GenerationHumanEval | CodeCode Available | 1 |