| NoFunEval: Funny How Code LMs Falter on Requirements Beyond Functional Correctness | Jan 29, 2024 | HumanEval | —Unverified | 0 |
| Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions | Jan 17, 2024 | Arithmetic ReasoningCode Generation | CodeCode Available | 1 |
| A Novel Approach for Automatic Program Repair using Round-Trip Translation with Large Language Models | Jan 15, 2024 | HumanEvalLanguage Modelling | CodeCode Available | 0 |
| OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models | Jan 12, 2024 | Code GenerationHumanEval | CodeCode Available | 1 |
| Mutation-based Consistency Testing for Evaluating the Code Understanding Capability of LLMs | Jan 11, 2024 | Code GenerationHumanEval | —Unverified | 0 |
| PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs | Jan 8, 2024 | Code GenerationDiversity | —Unverified | 0 |
| CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution | Jan 5, 2024 | HumanEvalPrediction | CodeCode Available | 4 |
| RepairLLaMA: Efficient Representations and Fine-Tuned Adapters for Program Repair | Dec 25, 2023 | HumanEvalparameter-efficient fine-tuning | CodeCode Available | 1 |
| Instruction Fusion: Advancing Prompt Evolution through Hybridization | Dec 25, 2023 | Code GenerationHumanEval | CodeCode Available | 0 |
| AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation | Dec 20, 2023 | Code GenerationHumanEval | CodeCode Available | 2 |