SOTAVerified

HumanEval

Papers

Showing 76100 of 264 papers

TitleStatusHype
Better & Faster Large Language Models via Multi-token PredictionCode1
XFT: Unlocking the Power of Code Instruction Tuning by Simply Merging Upcycled Mixture-of-ExpertsCode1
The RealHumanEval: Evaluating Large Language Models' Abilities to Support ProgrammersCode1
Self-Organized Agents: A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and OptimizationCode1
CYCLE: Learning to Self-Refine the Code GenerationCode1
InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language ModelsCode1
HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language GeneralizationCode1
Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language ModelsCode1
Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct DecodingCode1
DolphCoder: Echo-Locating Code Large Language Models with Diverse and Multi-Objective Instruction TuningCode1
Unsupervised Evaluation of Code LLMs with Round-Trip CorrectnessCode1
Getting the most out of your tokenizer for pre-training and domain adaptationCode1
Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided InterventionsCode1
OOP: Object-Oriented Programming Evaluation Benchmark for Large Language ModelsCode1
RepairLLaMA: Efficient Representations and Fine-Tuned Adapters for Program RepairCode1
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code CompletionCode1
CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modulesCode1
A Dynamic LLM-Powered Agent Network for Task-Oriented Agent CollaborationCode1
ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code GenerationCode1
Predicting Code Coverage without ExecutionCode1
Is Self-Repair a Silver Bullet for Code Generation?Code1
ANPL: Towards Natural Programming with Interactive DecompositionCode1
LeTI: Learning to Generate from Textual InteractionsCode1
ReCode: Robustness Evaluation of Code Generation ModelsCode1
Multi-lingual Evaluation of Code Generation ModelsCode1
Show:102550
← PrevPage 4 of 11Next →

No leaderboard results yet.