SOTAVerified

HumanEval

Papers

Showing 5175 of 264 papers

TitleStatusHype
MHPP: Exploring the Capabilities and Limitations of Language Models Beyond Basic Code GenerationCode1
Multi-lingual Evaluation of Code Generation ModelsCode1
MyGO Multiplex CoT: A Method for Self-Reflection in Large Language Models via Double Chain of Thought ThinkingCode1
LeTI: Learning to Generate from Textual InteractionsCode1
Predicting Code Coverage without ExecutionCode1
CodeCriticBench: A Holistic Code Critique Benchmark for Large Language ModelsCode1
CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modulesCode1
ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code GenerationCode1
Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation CapabilitiesCode1
Invisible Entropy: Towards Safe and Efficient Low-Entropy LLM WatermarkingCode1
Instruction Tuning With Loss Over InstructionsCode1
PerfCodeGen: Improving Performance of LLM Generated Code with Execution FeedbackCode1
ArchCode: Incorporating Software Requirements in Code Generation with Large Language ModelsCode1
InverseCoder: Self-improving Instruction-Tuned Code LLMs with Inverse-InstructCode1
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code GenerationCode1
Can Language Models Replace Programmers for Coding? REPOCOD Says 'Not Yet'Code1
HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding TasksCode1
How Efficient is LLM-Generated Code? A Rigorous & High-Standard BenchmarkCode1
CYCLE: Learning to Self-Refine the Code GenerationCode1
HALO: Hierarchical Autonomous Logic-Oriented Orchestration for Multi-Agent LLM SystemsCode1
How to Select Datapoints for Efficient Human Evaluation of NLG Models?Code1
HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language GeneralizationCode1
Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language ModelsCode1
How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality DataCode1
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code CompletionCode1
Show:102550
← PrevPage 3 of 11Next →

No leaderboard results yet.