SOTAVerified

HumanEval

Papers

Showing 150 of 264 papers

TitleStatusHype
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All ToolsCode14
Qwen2 Technical ReportCode13
AutoDev: Automated AI-Driven DevelopmentCode11
LLM4Decompile: Decompiling Binary Code with Large Language ModelsCode9
CodexGraph: Bridging Large Language Models and Code Repositories via Code Graph DatabasesCode7
CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding PreferencesCode7
Code Llama: Open Foundation Models for CodeCode6
CodeGen: An Open Large Language Model for Code with Multi-Turn Program SynthesisCode6
CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-XCode5
RLHF Workflow: From Reward Modeling to Online RLHFCode5
WizardCoder: Empowering Code Large Language Models with Evol-InstructCode5
OpenCodeInterpreter: Integrating Code Generation with Execution and RefinementCode5
StarCoder: may the source be with you!Code5
Reflexion: Language Agents with Verbal Reinforcement LearningCode4
CRUXEval: A Benchmark for Code Reasoning, Understanding and ExecutionCode4
Magicoder: Empowering Code Generation with OSS-InstructCode4
Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step-by-stepCode4
Baichuan 2: Open Large-scale Language ModelsCode4
Scaling Granite Code Models to 128K ContextCode4
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code GenerationCode3
Evaluating Large Language Models Trained on CodeCode3
Automatic Instruction Evolving for Large Language ModelsCode3
Web-Bench: A LLM Code Benchmark Based on Web Standards and FrameworksCode3
OctoPack: Instruction Tuning Code Large Language ModelsCode3
DataDecide: How to Predict Best Pretraining Data with Small ExperimentsCode3
LayerSkip: Enabling Early Exit Inference and Self-Speculative DecodingCode3
SelfCodeAlign: Self-Alignment for Code GenerationCode3
KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for CodingCode3
Training Language Models to Self-Correct via Reinforcement LearningCode2
From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical DebuggingCode2
Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLMCode2
A Survey on Large Language Models for Code GenerationCode2
Rethinking Benchmark and Contamination for Language Models with Rephrased SamplesCode2
MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code GenerationCode2
MasRouter: Learning to Route LLMs for Multi-Agent SystemsCode2
Nexus: A Lightweight and Scalable Multi-Agent Framework for Complex Tasks AutomationCode2
Parsel: Algorithmic Reasoning with Language Models by Composing DecompositionsCode2
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and OptimisationCode2
CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and DebuggingCode2
NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User PromptsCode2
any4: Learned 4-bit Numeric Representation for LLMsCode2
CodeT: Code Generation with Generated TestsCode2
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language ModelsCode2
MapCoder: Multi-Agent Code Generation for Competitive Problem SolvingCode2
Instruction Tuning With Loss Over InstructionsCode1
InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language ModelsCode1
InverseCoder: Self-improving Instruction-Tuned Code LLMs with Inverse-InstructCode1
Fault-Aware Neural Code RankersCode1
Invisible Entropy: Towards Safe and Efficient Low-Entropy LLM WatermarkingCode1
HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding TasksCode1
Show:102550
← PrevPage 1 of 6Next →

No leaderboard results yet.