| ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools | Jun 18, 2024 | AllGSM8K | CodeCode Available | 14 |
| Qwen2 Technical Report | Jul 15, 2024 | Arithmetic ReasoningGSM8K | CodeCode Available | 13 |
| AutoDev: Automated AI-Driven Development | Mar 13, 2024 | Code GenerationHumanEval | CodeCode Available | 11 |
| LLM4Decompile: Decompiling Binary Code with Large Language Models | Mar 8, 2024 | HumanEval | CodeCode Available | 9 |
| CodexGraph: Bridging Large Language Models and Code Repositories via Code Graph Databases | Aug 7, 2024 | HumanEvalmbpp | CodeCode Available | 7 |
| CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences | Mar 14, 2024 | HumanEval | CodeCode Available | 7 |
| Code Llama: Open Foundation Models for Code | Aug 24, 2023 | 16kCode Generation | CodeCode Available | 6 |
| CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis | Mar 25, 2022 | Code GenerationHumanEval | CodeCode Available | 6 |
| RLHF Workflow: From Reward Modeling to Online RLHF | May 13, 2024 | ChatbotHumanEval | CodeCode Available | 5 |
| OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement | Feb 22, 2024 | Code GenerationHumanEval | CodeCode Available | 5 |
| WizardCoder: Empowering Code Large Language Models with Evol-Instruct | Jun 14, 2023 | Code GenerationHumanEval | CodeCode Available | 5 |
| StarCoder: may the source be with you! | May 9, 2023 | 8kCode Generation | CodeCode Available | 5 |
| CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X | Mar 30, 2023 | BenchmarkingCode Generation | CodeCode Available | 5 |
| Scaling Granite Code Models to 128K Context | Jul 18, 2024 | 2k4k | CodeCode Available | 4 |
| Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step-by-step | Feb 25, 2024 | Code GenerationHumanEval | CodeCode Available | 4 |
| CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution | Jan 5, 2024 | HumanEvalPrediction | CodeCode Available | 4 |
| Magicoder: Empowering Code Generation with OSS-Instruct | Dec 4, 2023 | Code GenerationHumanEval | CodeCode Available | 4 |
| Baichuan 2: Open Large-scale Language Models | Sep 19, 2023 | Feature EngineeringGSM8K | CodeCode Available | 4 |
| Reflexion: Language Agents with Verbal Reinforcement Learning | Mar 20, 2023 | Decision MakingHumanEval | CodeCode Available | 4 |
| Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks | May 12, 2025 | Code Generation | CodeCode Available | 3 |
| DataDecide: How to Predict Best Pretraining Data with Small Experiments | Apr 15, 2025 | ARCHellaSwag | CodeCode Available | 3 |
| KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding | Mar 4, 2025 | HumanEvalmbpp | CodeCode Available | 3 |
| SelfCodeAlign: Self-Alignment for Code Generation | Oct 31, 2024 | Code GenerationHumanEval | CodeCode Available | 3 |
| Automatic Instruction Evolving for Large Language Models | Jun 2, 2024 | GSM8KHumanEval | CodeCode Available | 3 |
| LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding | Apr 25, 2024 | GSM8KHellaSwag | CodeCode Available | 3 |
| OctoPack: Instruction Tuning Code Large Language Models | Aug 14, 2023 | Code GenerationCode Repair | CodeCode Available | 3 |
| Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation | May 2, 2023 | Code GenerationHumanEval | CodeCode Available | 3 |
| Evaluating Large Language Models Trained on Code | Jul 7, 2021 | Code GenerationHumanEval | CodeCode Available | 3 |
| any4: Learned 4-bit Numeric Representation for LLMs | Jul 7, 2025 | GPUGSM8K | CodeCode Available | 2 |
| Nexus: A Lightweight and Scalable Multi-Agent Framework for Complex Tasks Automation | Feb 26, 2025 | Code GenerationHumanEval | CodeCode Available | 2 |
| MasRouter: Learning to Route LLMs for Multi-Agent Systems | Feb 16, 2025 | HumanEvalmbpp | CodeCode Available | 2 |
| CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging | Feb 8, 2025 | Code GenerationHumanEval | CodeCode Available | 2 |
| From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging | Oct 2, 2024 | Auto DebuggingBug fixing | CodeCode Available | 2 |
| Training Language Models to Self-Correct via Reinforcement Learning | Sep 19, 2024 | HumanEvalMath | CodeCode Available | 2 |
| A Survey on Large Language Models for Code Generation | Jun 1, 2024 | Code GenerationHumanEval | CodeCode Available | 2 |
| MapCoder: Multi-Agent Code Generation for Competitive Problem Solving | May 18, 2024 | Code GenerationHumanEval | CodeCode Available | 2 |
| NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts | May 7, 2024 | HumanEvalmbpp | CodeCode Available | 2 |
| Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM | Mar 28, 2024 | Code GenerationHumanEval | CodeCode Available | 2 |
| AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation | Dec 20, 2023 | Code GenerationHumanEval | CodeCode Available | 2 |
| Rethinking Benchmark and Contamination for Language Models with Rephrased Samples | Nov 8, 2023 | HumanEvalMMLU | CodeCode Available | 2 |
| Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models | Oct 6, 2023 | Code GenerationDecision Making | CodeCode Available | 2 |
| Parsel: Algorithmic Reasoning with Language Models by Composing Decompositions | Dec 20, 2022 | Automated Theorem ProvingCode Generation | CodeCode Available | 2 |
| MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation | Aug 17, 2022 | BenchmarkingCode Generation | CodeCode Available | 2 |
| CodeT: Code Generation with Generated Tests | Jul 21, 2022 | Code GenerationHumanEval | CodeCode Available | 2 |
| Rethinking Verification for LLM Code Generation: From Generation to Testing | Jul 9, 2025 | Code GenerationHumanEval | CodeCode Available | 1 |
| Invisible Entropy: Towards Safe and Efficient Low-Entropy LLM Watermarking | May 20, 2025 | HumanEvalmbpp | CodeCode Available | 1 |
| HALO: Hierarchical Autonomous Logic-Oriented Orchestration for Multi-Agent LLM Systems | May 17, 2025 | Arithmetic ReasoningCode Generation | CodeCode Available | 1 |
| Rethinking Repetition Problems of LLMs in Code Generation | May 15, 2025 | Code GenerationHumanEval | CodeCode Available | 1 |
| Rewriting Pre-Training Data Boosts LLM Performance in Math and Code | May 5, 2025 | Code GenerationGSM8K | CodeCode Available | 1 |
| RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing | Mar 10, 2025 | Code GenerationHumanEval | CodeCode Available | 1 |