SOTAVerified

HumanEval

Papers

Showing 150 of 264 papers

TitleStatusHype
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All ToolsCode14
Qwen2 Technical ReportCode13
AutoDev: Automated AI-Driven DevelopmentCode11
LLM4Decompile: Decompiling Binary Code with Large Language ModelsCode9
CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding PreferencesCode7
CodexGraph: Bridging Large Language Models and Code Repositories via Code Graph DatabasesCode7
CodeGen: An Open Large Language Model for Code with Multi-Turn Program SynthesisCode6
Code Llama: Open Foundation Models for CodeCode6
CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-XCode5
StarCoder: may the source be with you!Code5
OpenCodeInterpreter: Integrating Code Generation with Execution and RefinementCode5
WizardCoder: Empowering Code Large Language Models with Evol-InstructCode5
RLHF Workflow: From Reward Modeling to Online RLHFCode5
Reflexion: Language Agents with Verbal Reinforcement LearningCode4
CRUXEval: A Benchmark for Code Reasoning, Understanding and ExecutionCode4
Scaling Granite Code Models to 128K ContextCode4
Magicoder: Empowering Code Generation with OSS-InstructCode4
Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step-by-stepCode4
Baichuan 2: Open Large-scale Language ModelsCode4
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code GenerationCode3
Automatic Instruction Evolving for Large Language ModelsCode3
Web-Bench: A LLM Code Benchmark Based on Web Standards and FrameworksCode3
Evaluating Large Language Models Trained on CodeCode3
OctoPack: Instruction Tuning Code Large Language ModelsCode3
DataDecide: How to Predict Best Pretraining Data with Small ExperimentsCode3
LayerSkip: Enabling Early Exit Inference and Self-Speculative DecodingCode3
SelfCodeAlign: Self-Alignment for Code GenerationCode3
KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for CodingCode3
CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and DebuggingCode2
Training Language Models to Self-Correct via Reinforcement LearningCode2
CodeT: Code Generation with Generated TestsCode2
Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLMCode2
A Survey on Large Language Models for Code GenerationCode2
Rethinking Benchmark and Contamination for Language Models with Rephrased SamplesCode2
From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical DebuggingCode2
MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code GenerationCode2
Nexus: A Lightweight and Scalable Multi-Agent Framework for Complex Tasks AutomationCode2
NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User PromptsCode2
Parsel: Algorithmic Reasoning with Language Models by Composing DecompositionsCode2
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and OptimisationCode2
MapCoder: Multi-Agent Code Generation for Competitive Problem SolvingCode2
any4: Learned 4-bit Numeric Representation for LLMsCode2
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language ModelsCode2
MasRouter: Learning to Route LLMs for Multi-Agent SystemsCode2
Instruction Tuning With Loss Over InstructionsCode1
InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language ModelsCode1
InverseCoder: Self-improving Instruction-Tuned Code LLMs with Inverse-InstructCode1
HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language GeneralizationCode1
Invisible Entropy: Towards Safe and Efficient Low-Entropy LLM WatermarkingCode1
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code GenerationCode1
Show:102550
← PrevPage 1 of 6Next →

No leaderboard results yet.