SOTAVerified

mbpp

Papers

Showing 150 of 129 papers

TitleStatusHype
EvoAgentX: An Automated Framework for Evolving Agentic WorkflowsCode7
CodexGraph: Bridging Large Language Models and Code Repositories via Code Graph DatabasesCode7
Code Llama: Open Foundation Models for CodeCode6
WizardCoder: Empowering Code Large Language Models with Evol-InstructCode5
OpenCodeInterpreter: Integrating Code Generation with Execution and RefinementCode5
Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step-by-stepCode4
Web-Bench: A LLM Code Benchmark Based on Web Standards and FrameworksCode3
KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for CodingCode3
DataDecide: How to Predict Best Pretraining Data with Small ExperimentsCode3
NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User PromptsCode2
any4: Learned 4-bit Numeric Representation for LLMsCode2
MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code GenerationCode2
MapCoder: Multi-Agent Code Generation for Competitive Problem SolvingCode2
MasRouter: Learning to Route LLMs for Multi-Agent SystemsCode2
CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement LearningCode2
InterCode: Standardizing and Benchmarking Interactive Coding with Execution FeedbackCode2
CodeT: Code Generation with Generated TestsCode2
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and OptimisationCode2
Two Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative ReasoningCode2
A Survey on Large Language Models for Code GenerationCode2
CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and DebuggingCode2
OOP: Object-Oriented Programming Evaluation Benchmark for Large Language ModelsCode1
Can Language Models Replace Programmers for Coding? REPOCOD Says 'Not Yet'Code1
Rethinking Repetition Problems of LLMs in Code GenerationCode1
Multiple-Choice Questions are Efficient and Robust LLM EvaluatorsCode1
PerfCodeGen: Improving Performance of LLM Generated Code with Execution FeedbackCode1
Fault-Aware Neural Code RankersCode1
ReflectionCoder: Learning from Reflection Sequence for Enhanced One-off Code GenerationCode1
Clover: Closed-Loop Verifiable Code GenerationCode1
Control LLM: Controlled Evolution for Intelligence Retention in LLMCode1
RLTF: Reinforcement Learning from Unit Test FeedbackCode1
LeTI: Learning to Generate from Textual InteractionsCode1
Planning In Natural Language Improves LLM Search For Code GenerationCode1
Policy Filtration in RLHF to Fine-Tune LLM for Code GenerationCode1
Invisible Entropy: Towards Safe and Efficient Low-Entropy LLM WatermarkingCode1
DolphCoder: Echo-Locating Code Large Language Models with Diverse and Multi-Objective Instruction TuningCode1
Planning-Driven Programming: A Large Language Model Programming WorkflowCode1
Program Synthesis with Large Language ModelsCode1
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code GenerationCode1
Improving Code Generation by Training with Natural Language FeedbackCode1
DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based SamplingCode1
InverseCoder: Self-improving Instruction-Tuned Code LLMs with Inverse-InstructCode1
Better & Faster Large Language Models via Multi-token PredictionCode1
InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language ModelsCode1
CodeCriticBench: A Holistic Code Critique Benchmark for Large Language ModelsCode1
Learning to Generate Unit Tests for Automated DebuggingCode1
CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modulesCode1
CYCLE: Learning to Self-Refine the Code GenerationCode1
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code CompletionCode1
Getting the most out of your tokenizer for pre-training and domain adaptationCode1
Show:102550
← PrevPage 1 of 3Next →

No leaderboard results yet.