SOTAVerified

HumanEval

Papers

Showing 51100 of 264 papers

TitleStatusHype
Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-ContrastCode1
Rewriting Pre-Training Data Boosts LLM Performance in Math and CodeCode1
Fault-Aware Neural Code RankersCode1
Rethinking Repetition Problems of LLMs in Code GenerationCode1
RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox TestingCode1
How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality DataCode1
RES-Q: Evaluating Code-Editing Large Language Model Systems at the Repository ScaleCode1
Rethinking Verification for LLM Code Generation: From Generation to TestingCode1
Self-Organized Agents: A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and OptimizationCode1
Unsupervised Evaluation of Code LLMs with Round-Trip CorrectnessCode1
The RealHumanEval: Evaluating Large Language Models' Abilities to Support ProgrammersCode1
Planning In Natural Language Improves LLM Search For Code GenerationCode1
Planning-Driven Programming: A Large Language Model Programming WorkflowCode1
Policy Filtration in RLHF to Fine-Tune LLM for Code GenerationCode1
CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modulesCode1
OOP: Object-Oriented Programming Evaluation Benchmark for Large Language ModelsCode1
PerfCodeGen: Improving Performance of LLM Generated Code with Execution FeedbackCode1
Predicting Code Coverage without ExecutionCode1
A Dynamic LLM-Powered Agent Network for Task-Oriented Agent CollaborationCode1
Multiple-Choice Questions are Efficient and Robust LLM EvaluatorsCode1
MyGO Multiplex CoT: A Method for Self-Reflection in Large Language Models via Double Chain of Thought ThinkingCode1
MHPP: Exploring the Capabilities and Limitations of Language Models Beyond Basic Code GenerationCode1
Getting the most out of your tokenizer for pre-training and domain adaptationCode1
ArchCode: Incorporating Software Requirements in Code Generation with Large Language ModelsCode1
Multi-lingual Evaluation of Code Generation ModelsCode1
Learning to Generate Unit Tests for Automated DebuggingCode1
Is Self-Repair a Silver Bullet for Code Generation?Code1
Can Language Models Replace Programmers for Coding? REPOCOD Says 'Not Yet'Code1
LeTI: Learning to Generate from Textual InteractionsCode1
Invisible Entropy: Towards Safe and Efficient Low-Entropy LLM WatermarkingCode1
DolphCoder: Echo-Locating Code Large Language Models with Diverse and Multi-Objective Instruction TuningCode1
CYCLE: Learning to Self-Refine the Code GenerationCode1
InverseCoder: Self-improving Instruction-Tuned Code LLMs with Inverse-InstructCode1
HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding TasksCode1
HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language GeneralizationCode1
ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code GenerationCode1
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code CompletionCode1
Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct DecodingCode1
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code GenerationCode1
InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language ModelsCode1
Better & Faster Large Language Models via Multi-token PredictionCode1
Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language ModelsCode1
How Efficient is LLM-Generated Code? A Rigorous & High-Standard BenchmarkCode1
CodeCriticBench: A Holistic Code Critique Benchmark for Large Language ModelsCode1
ContraCLM: Contrastive Learning For Causal Language ModelCode1
ReflectionCoder: Learning from Reflection Sequence for Enhanced One-off Code GenerationCode1
ANPL: Towards Natural Programming with Interactive DecompositionCode1
RepairLLaMA: Efficient Representations and Fine-Tuned Adapters for Program RepairCode1
How to Select Datapoints for Efficient Human Evaluation of NLG Models?Code1
Instruction Tuning With Loss Over InstructionsCode1
Show:102550
← PrevPage 2 of 6Next →

No leaderboard results yet.