SOTAVerified

HumanEval

Papers

Showing 51100 of 264 papers

TitleStatusHype
CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and DebuggingCode2
Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment0
Large Language Model Guided Self-Debugging Code Generation0
Learning to Generate Unit Tests for Automated DebuggingCode1
ACECODER: Acing Coder RL via Automated Test-Case Synthesis0
Importing Phantoms: Measuring LLM Package Hallucination Vulnerabilities0
How to Select Datapoints for Efficient Human Evaluation of NLG Models?Code1
CoCoNUT: Structural Code Understanding does not fall out of a treeCode0
QualityFlow: An Agentic Workflow for Program Synthesis Controlled by LLM Quality Checks0
MyGO Multiplex CoT: A Method for Self-Reflection in Large Language Models via Double Chain of Thought ThinkingCode1
Leveraging Metamemory Mechanisms for Enhanced Data-Free Code Generation in LLMs0
Guided Code Generation with LLMs: A Multi-Agent Framework for Complex Code Tasks0
Dafny as Verification-Aware Intermediate Language for Code Generation0
InfiFusion: A Unified Framework for Enhanced Cross-Model Reasoning via LLM Fusion0
Dynamic Scaling of Unit Tests for Code Reward Modeling0
Thinking Before Running! Efficient Code Generation with Thorough Exploration and Optimal Refinement0
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code GenerationCode1
SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity0
Dovetail: A CPU/GPU Heterogeneous Speculative Decoding for LLM inference0
Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models0
PERC: Plan-As-Query Example Retrieval for Underrepresented Code Generation0
Falcon: Faster and Parallel Inference of Large Language Models through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree0
Learning to Reason via Self-Iterative Process Feedback for Small Language Models0
AlphaVerus: Bootstrapping Formally Verified Code Generation through Self-Improving Translation and Treefinement0
Does Few-Shot Learning Help LLM Performance in Code Synthesis?0
Addressing Data Leakage in HumanEval Using Combinatorial Test Design0
Inference Scaling fLaws: The Limits of LLM Resampling with Imperfect VerifiersCode0
A Preliminary Study of Multilingual Code Language Models for Code Generation Task Using Translated Benchmarks0
Planning-Driven Programming: A Large Language Model Programming WorkflowCode1
DSTC: Direct Preference Learning with Only Self-Generated Tests and Code to Improve Code LMs0
PerfCodeGen: Improving Performance of LLM Generated Code with Execution FeedbackCode1
VALTEST: Automated Validation of Language Model Generated Test Cases0
Synthesize, Partition, then Adapt: Eliciting Diverse Samples from Foundation Models0
CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models0
InterTrans: Leveraging Transitive Intermediate Translations to Enhance LLM-based Code TranslationCode0
SelfCodeAlign: Self-Alignment for Code GenerationCode3
Demo-Craft: Using In-Context Learning to Improve Code Generation in Large Language Models0
Can Language Models Replace Programmers for Coding? REPOCOD Says 'Not Yet'Code1
FALCON: Feedback-driven Adaptive Long/short-term memory reinforced Coding Optimization systemCode0
Aligning CodeLLMs with Direct Preference Optimization0
Adaptive Dense Reward: Understanding the Gap Between Action and Reward Space in Alignment0
MojoBench: Language Modeling and Benchmarks for Mojo0
Scattered Forest Search: Smarter Code Space Exploration with LLMs0
Self-Evolving Multi-Agent Collaboration Networks for Software Development0
Semantic-guided Search for Efficient Program Repair with Large Language Models0
Self-Explained Keywords Empower Large Language Models for Code Generation0
mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code GenerationCode0
CELI: Controller-Embedded Language Model Interactions0
HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding TasksCode1
G-Designer: Architecting Multi-agent Communication Topologies via Graph Neural Networks0
Show:102550
← PrevPage 2 of 6Next →

No leaderboard results yet.