SOTAVerified

HumanEval

Papers

Showing 51100 of 264 papers

TitleStatusHype
CodeCriticBench: A Holistic Code Critique Benchmark for Large Language ModelsCode1
Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation CapabilitiesCode1
Learning to Generate Unit Tests for Automated DebuggingCode1
How to Select Datapoints for Efficient Human Evaluation of NLG Models?Code1
MyGO Multiplex CoT: A Method for Self-Reflection in Large Language Models via Double Chain of Thought ThinkingCode1
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code GenerationCode1
Planning-Driven Programming: A Large Language Model Programming WorkflowCode1
PerfCodeGen: Improving Performance of LLM Generated Code with Execution FeedbackCode1
Can Language Models Replace Programmers for Coding? REPOCOD Says 'Not Yet'Code1
HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding TasksCode1
Training Language Models on Synthetic Edit Sequences Improves Code SynthesisCode1
Policy Filtration in RLHF to Fine-Tune LLM for Code GenerationCode1
How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality DataCode1
Planning In Natural Language Improves LLM Search For Code GenerationCode1
ArchCode: Incorporating Software Requirements in Code Generation with Large Language ModelsCode1
InverseCoder: Self-improving Instruction-Tuned Code LLMs with Inverse-InstructCode1
RES-Q: Evaluating Code-Editing Large Language Model Systems at the Repository ScaleCode1
How Efficient is LLM-Generated Code? A Rigorous & High-Standard BenchmarkCode1
SemCoder: Training Code Language Models with Comprehensive Semantics ReasoningCode1
ReflectionCoder: Learning from Reflection Sequence for Enhanced One-off Code GenerationCode1
EffiLearner: Enhancing Efficiency of Generated Code via Self-OptimizationCode1
Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-ContrastCode1
Instruction Tuning With Loss Over InstructionsCode1
Multiple-Choice Questions are Efficient and Robust LLM EvaluatorsCode1
MHPP: Exploring the Capabilities and Limitations of Language Models Beyond Basic Code GenerationCode1
Better & Faster Large Language Models via Multi-token PredictionCode1
XFT: Unlocking the Power of Code Instruction Tuning by Simply Merging Upcycled Mixture-of-ExpertsCode1
The RealHumanEval: Evaluating Large Language Models' Abilities to Support ProgrammersCode1
Self-Organized Agents: A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and OptimizationCode1
CYCLE: Learning to Self-Refine the Code GenerationCode1
InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language ModelsCode1
HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language GeneralizationCode1
Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language ModelsCode1
Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct DecodingCode1
DolphCoder: Echo-Locating Code Large Language Models with Diverse and Multi-Objective Instruction TuningCode1
Unsupervised Evaluation of Code LLMs with Round-Trip CorrectnessCode1
Getting the most out of your tokenizer for pre-training and domain adaptationCode1
Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided InterventionsCode1
OOP: Object-Oriented Programming Evaluation Benchmark for Large Language ModelsCode1
RepairLLaMA: Efficient Representations and Fine-Tuned Adapters for Program RepairCode1
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code CompletionCode1
CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modulesCode1
A Dynamic LLM-Powered Agent Network for Task-Oriented Agent CollaborationCode1
ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code GenerationCode1
Predicting Code Coverage without ExecutionCode1
Is Self-Repair a Silver Bullet for Code Generation?Code1
ANPL: Towards Natural Programming with Interactive DecompositionCode1
LeTI: Learning to Generate from Textual InteractionsCode1
ReCode: Robustness Evaluation of Code Generation ModelsCode1
Multi-lingual Evaluation of Code Generation ModelsCode1
Show:102550
← PrevPage 2 of 6Next →

No leaderboard results yet.