SOTAVerified

HumanEval

Papers

Showing 150 of 264 papers

TitleStatusHype
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All ToolsCode14
Qwen2 Technical ReportCode13
AutoDev: Automated AI-Driven DevelopmentCode11
LLM4Decompile: Decompiling Binary Code with Large Language ModelsCode9
CodexGraph: Bridging Large Language Models and Code Repositories via Code Graph DatabasesCode7
CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding PreferencesCode7
Code Llama: Open Foundation Models for CodeCode6
CodeGen: An Open Large Language Model for Code with Multi-Turn Program SynthesisCode6
RLHF Workflow: From Reward Modeling to Online RLHFCode5
OpenCodeInterpreter: Integrating Code Generation with Execution and RefinementCode5
WizardCoder: Empowering Code Large Language Models with Evol-InstructCode5
StarCoder: may the source be with you!Code5
CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-XCode5
Scaling Granite Code Models to 128K ContextCode4
Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step-by-stepCode4
CRUXEval: A Benchmark for Code Reasoning, Understanding and ExecutionCode4
Magicoder: Empowering Code Generation with OSS-InstructCode4
Baichuan 2: Open Large-scale Language ModelsCode4
Reflexion: Language Agents with Verbal Reinforcement LearningCode4
Web-Bench: A LLM Code Benchmark Based on Web Standards and FrameworksCode3
DataDecide: How to Predict Best Pretraining Data with Small ExperimentsCode3
KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for CodingCode3
SelfCodeAlign: Self-Alignment for Code GenerationCode3
Automatic Instruction Evolving for Large Language ModelsCode3
LayerSkip: Enabling Early Exit Inference and Self-Speculative DecodingCode3
OctoPack: Instruction Tuning Code Large Language ModelsCode3
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code GenerationCode3
Evaluating Large Language Models Trained on CodeCode3
any4: Learned 4-bit Numeric Representation for LLMsCode2
Nexus: A Lightweight and Scalable Multi-Agent Framework for Complex Tasks AutomationCode2
MasRouter: Learning to Route LLMs for Multi-Agent SystemsCode2
CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and DebuggingCode2
From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical DebuggingCode2
Training Language Models to Self-Correct via Reinforcement LearningCode2
A Survey on Large Language Models for Code GenerationCode2
MapCoder: Multi-Agent Code Generation for Competitive Problem SolvingCode2
NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User PromptsCode2
Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLMCode2
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and OptimisationCode2
Rethinking Benchmark and Contamination for Language Models with Rephrased SamplesCode2
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language ModelsCode2
Parsel: Algorithmic Reasoning with Language Models by Composing DecompositionsCode2
MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code GenerationCode2
CodeT: Code Generation with Generated TestsCode2
Rethinking Verification for LLM Code Generation: From Generation to TestingCode1
Invisible Entropy: Towards Safe and Efficient Low-Entropy LLM WatermarkingCode1
HALO: Hierarchical Autonomous Logic-Oriented Orchestration for Multi-Agent LLM SystemsCode1
Rethinking Repetition Problems of LLMs in Code GenerationCode1
Rewriting Pre-Training Data Boosts LLM Performance in Math and CodeCode1
RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox TestingCode1
Show:102550
← PrevPage 1 of 6Next →

No leaderboard results yet.