SOTAVerified

HumanEval

Papers

Showing 76100 of 264 papers

TitleStatusHype
Can Language Models Replace Programmers for Coding? REPOCOD Says 'Not Yet'Code1
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code GenerationCode1
HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding TasksCode1
How to Select Datapoints for Efficient Human Evaluation of NLG Models?Code1
CYCLE: Learning to Self-Refine the Code GenerationCode1
DolphCoder: Echo-Locating Code Large Language Models with Diverse and Multi-Objective Instruction TuningCode1
InverseCoder: Self-improving Instruction-Tuned Code LLMs with Inverse-InstructCode1
InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language ModelsCode1
ReCode: Robustness Evaluation of Code Generation ModelsCode1
RepairLLaMA: Efficient Representations and Fine-Tuned Adapters for Program RepairCode1
ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code GenerationCode1
Getting the most out of your tokenizer for pre-training and domain adaptationCode1
HALO: Hierarchical Autonomous Logic-Oriented Orchestration for Multi-Agent LLM SystemsCode1
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code CompletionCode1
Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct DecodingCode1
How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality DataCode1
Better & Faster Large Language Models via Multi-token PredictionCode1
Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language ModelsCode1
CodeCriticBench: A Holistic Code Critique Benchmark for Large Language ModelsCode1
ContraCLM: Contrastive Learning For Causal Language ModelCode1
Multiple-Choice Questions are Efficient and Robust LLM EvaluatorsCode1
ANPL: Towards Natural Programming with Interactive DecompositionCode1
Multi-lingual Evaluation of Code Generation ModelsCode1
How Efficient is LLM-Generated Code? A Rigorous & High-Standard BenchmarkCode1
RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox TestingCode1
Show:102550
← PrevPage 4 of 11Next →

No leaderboard results yet.