HumanEval

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 51–100 of 264 papers

Title	Date	Tasks	Status	Hype	Score
Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions	Jan 17, 2024	Arithmetic ReasoningCode Generation	CodeCode Available	1	5
MyGO Multiplex CoT: A Method for Self-Reflection in Large Language Models via Double Chain of Thought Thinking	Jan 20, 2025	Decision MakingGSM8K	CodeCode Available	1	5
Fault-Aware Neural Code Rankers	Jun 4, 2022	Code GenerationHumanEval	CodeCode Available	1	5
SemCoder: Training Code Language Models with Comprehensive Semantics Reasoning	Jun 3, 2024	Code CompletionCode Generation	CodeCode Available	1	5
OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models	Jan 12, 2024	Code GenerationHumanEval	CodeCode Available	1	5
MHPP: Exploring the Capabilities and Limitations of Language Models Beyond Basic Code Generation	May 19, 2024	Code GenerationHumanEval	CodeCode Available	1	5
Predicting Code Coverage without Execution	Jul 25, 2023	HumanEval	CodeCode Available	1	5
EffiLearner: Enhancing Efficiency of Generated Code via Self-Optimization	May 24, 2024	Code GenerationHumanEval	CodeCode Available	1	5
The RealHumanEval: Evaluating Large Language Models' Abilities to Support Programmers	Apr 3, 2024	HumanEval	CodeCode Available	1	5
Unsupervised Evaluation of Code LLMs with Round-Trip Correctness	Feb 13, 2024	HumanEvalmbpp	CodeCode Available	1	5
Planning-Driven Programming: A Large Language Model Programming Workflow	Nov 21, 2024	Code GenerationHumanEval	CodeCode Available	1	5
LeTI: Learning to Generate from Textual Interactions	May 17, 2023	Code GenerationEvent Argument Extraction	CodeCode Available	1	5
Learning to Generate Unit Tests for Automated Debugging	Feb 3, 2025	HumanEvalLarge Language Model	CodeCode Available	1	5
CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules	Oct 13, 2023	Code GenerationHumanEval	CodeCode Available	1	5
Rethinking Verification for LLM Code Generation: From Generation to Testing	Jul 9, 2025	Code GenerationHumanEval	CodeCode Available	1	5
Rewriting Pre-Training Data Boosts LLM Performance in Math and Code	May 5, 2025	Code GenerationGSM8K	CodeCode Available	1	5
A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration	Oct 3, 2023	Arithmetic ReasoningCode Generation	CodeCode Available	1	5
Invisible Entropy: Towards Safe and Efficient Low-Entropy LLM Watermarking	May 20, 2025	HumanEvalmbpp	CodeCode Available	1	5
RES-Q: Evaluating Code-Editing Large Language Model Systems at the Repository Scale	Jun 24, 2024	Code GenerationHumanEval	CodeCode Available	1	5
Instruction Tuning With Loss Over Instructions	May 23, 2024	HumanEvalMMLU	CodeCode Available	1	5
ArchCode: Incorporating Software Requirements in Code Generation with Large Language Models	Aug 2, 2024	Code GenerationHumanEval	CodeCode Available	1	5
Rethinking Repetition Problems of LLMs in Code Generation	May 15, 2025	Code GenerationHumanEval	CodeCode Available	1	5
HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization	Feb 26, 2024	Code GenerationHumanEval	CodeCode Available	1	5
ReflectionCoder: Learning from Reflection Sequence for Enhanced One-off Code Generation	May 27, 2024	Code GenerationHumanEval	CodeCode Available	1	5
Is Self-Repair a Silver Bullet for Code Generation?	Jun 16, 2023	Code GenerationHumanEval	CodeCode Available	1	5
Can Language Models Replace Programmers for Coding? REPOCOD Says 'Not Yet'	Oct 29, 2024	Code CompletionCode Generation	CodeCode Available	1	5
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation	Dec 30, 2024	Code GenerationHumanEval	CodeCode Available	1	5
HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks	Oct 16, 2024	Code GenerationHumanEval	CodeCode Available	1	5
How to Select Datapoints for Efficient Human Evaluation of NLG Models?	Jan 30, 2025	HumanEvalMachine Translation	CodeCode Available	1	5
CYCLE: Learning to Self-Refine the Code Generation	Mar 27, 2024	Code GenerationHumanEval	CodeCode Available	1	5
DolphCoder: Echo-Locating Code Large Language Models with Diverse and Multi-Objective Instruction Tuning	Feb 14, 2024	Code GenerationHumanEval	CodeCode Available	1	5
InverseCoder: Self-improving Instruction-Tuned Code LLMs with Inverse-Instruct	Jul 8, 2024	Code GenerationCode Summarization	CodeCode Available	1	5
InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models	Mar 11, 2024	Code GenerationHumanEval	CodeCode Available	1	5
ReCode: Robustness Evaluation of Code Generation Models	Dec 20, 2022	Code GenerationHumanEval	CodeCode Available	1	5
RepairLLaMA: Efficient Representations and Fine-Tuned Adapters for Program Repair	Dec 25, 2023	HumanEvalparameter-efficient fine-tuning	CodeCode Available	1	5
ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation	Aug 3, 2023	Class-level Code GenerationCode Generation	CodeCode Available	1	5
Getting the most out of your tokenizer for pre-training and domain adaptation	Feb 1, 2024	Code GenerationDomain Adaptation	CodeCode Available	1	5
HALO: Hierarchical Autonomous Logic-Oriented Orchestration for Multi-Agent LLM Systems	May 17, 2025	Arithmetic ReasoningCode Generation	CodeCode Available	1	5
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion	Oct 17, 2023	Code CompletionHumanEval	CodeCode Available	1	5
Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding	Feb 19, 2024	HumanEvalLanguage Modeling	CodeCode Available	1	5
How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data	Sep 5, 2024	Code GenerationDiversity	CodeCode Available	1	5
Better & Faster Large Language Models via Multi-token Prediction	Apr 30, 2024	HumanEvalmbpp	CodeCode Available	1	5
Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models	Feb 24, 2024	HumanEvalMemorization	CodeCode Available	1	5
CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models	Feb 23, 2025	Code GenerationHumanEval	CodeCode Available	1	5
ContraCLM: Contrastive Learning For Causal Language Model	Oct 3, 2022	Code GenerationCode Search	CodeCode Available	1	5
Multiple-Choice Questions are Efficient and Robust LLM Evaluators	May 20, 2024	GSM8KHumanEval	CodeCode Available	1	5
ANPL: Towards Natural Programming with Interactive Decomposition	May 29, 2023	ARCCode Generation	CodeCode Available	1	5
Multi-lingual Evaluation of Code Generation Models	Oct 26, 2022	Code CompletionCode Generation	CodeCode Available	1	5
How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark	Jun 10, 2024	HumanEvalProgram Synthesis	CodeCode Available	1	5
RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing	Mar 10, 2025	Code GenerationHumanEval	CodeCode Available	1	5

Show:10 25 50

← PrevPage 2 of 6Next →

No leaderboard results yet.