| CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models | Feb 23, 2025 | Code GenerationHumanEval | CodeCode Available | 1 |
| Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities | Feb 17, 2025 | Code GenerationHumanEval | CodeCode Available | 1 |
| Learning to Generate Unit Tests for Automated Debugging | Feb 3, 2025 | HumanEvalLarge Language Model | CodeCode Available | 1 |
| How to Select Datapoints for Efficient Human Evaluation of NLG Models? | Jan 30, 2025 | HumanEvalMachine Translation | CodeCode Available | 1 |
| MyGO Multiplex CoT: A Method for Self-Reflection in Large Language Models via Double Chain of Thought Thinking | Jan 20, 2025 | Decision MakingGSM8K | CodeCode Available | 1 |
| HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation | Dec 30, 2024 | Code GenerationHumanEval | CodeCode Available | 1 |
| Planning-Driven Programming: A Large Language Model Programming Workflow | Nov 21, 2024 | Code GenerationHumanEval | CodeCode Available | 1 |
| PerfCodeGen: Improving Performance of LLM Generated Code with Execution Feedback | Nov 18, 2024 | HumanEvalmbpp | CodeCode Available | 1 |
| Can Language Models Replace Programmers for Coding? REPOCOD Says 'Not Yet' | Oct 29, 2024 | Code CompletionCode Generation | CodeCode Available | 1 |
| HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks | Oct 16, 2024 | Code GenerationHumanEval | CodeCode Available | 1 |