HumanEval

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–50 of 264 papers

Title	Date	Tasks	Status	Hype
Turning the Tide: Repository-based Code Reflection	Jul 14, 2025	Code GenerationDiversity	—Unverified	0
Rethinking Verification for LLM Code Generation: From Generation to Testing	Jul 9, 2025	Code GenerationHumanEval	CodeCode Available	1
any4: Learned 4-bit Numeric Representation for LLMs	Jul 7, 2025	GPUGSM8K	CodeCode Available	2
SACL: Understanding and Combating Textual Bias in Code Retrieval with Semantic-Augmented Reranking and Localization	Jun 25, 2025	Code GenerationHumanEval	—Unverified	0
Plan for Speed -- Dilated Scheduling for Masked Diffusion Language Models	Jun 23, 2025	Code CompletionGSM8K	—Unverified	0
AgentGroupChat-V2: Divide-and-Conquer Is What LLM-Based Multi-Agent System Need	Jun 18, 2025	GSM8KHumanEval	CodeCode Available	0
LoRA-Mixer: Coordinate Modular LoRA Experts Through Serial Attention Routing	Jun 17, 2025	ARCCoLA	—Unverified	0
Guaranteed Guess: A Language Modeling Approach for CISC-to-RISC Transpilation with Testing Guarantees	Jun 17, 2025	Code TranslationHumanEval	—Unverified	0
Guideline Forest: Experience-Induced Multi-Guideline Reasoning with Stepwise Aggregation	Jun 9, 2025	GSM8KHumanEval	—Unverified	0
SwiftEval: Developing a Language-Specific Benchmark for LLM-generated Code Evaluation	May 30, 2025	Code GenerationHumanEval	—Unverified	0
Enhancing LLM-Based Code Generation with Complexity Metrics: A Feedback-Driven Approach	May 29, 2025	Code GenerationHumanEval	—Unverified	0
Actor-Critic based Online Data Mixing For Language Model Pre-Training	May 29, 2025	HumanEvalLanguage Modeling	—Unverified	0
Self-Correcting Code Generation Using Small Language Models	May 29, 2025	Code GenerationHumanEval	CodeCode Available	0
An LLM-as-Judge Metric for Bridging the Gap with Human Evaluation in SE Tasks	May 27, 2025	Code GenerationCode Summarization	—Unverified	0
Evaluating Large Language Models for Code Review	May 26, 2025	HumanEval	—Unverified	0
LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models	May 25, 2025	GSM8KHumanEval	—Unverified	0
From Output to Evaluation: Does Raw Instruction-Tuned Code LLMs Output Suffice for Fill-in-the-Middle Code Generation?	May 24, 2025	Code GenerationHumanEval	—Unverified	0
Prior Prompt Engineering for Reinforcement Fine-Tuning	May 20, 2025	HumanEvalPrompt Engineering	—Unverified	0
Invisible Entropy: Towards Safe and Efficient Low-Entropy LLM Watermarking	May 20, 2025	HumanEvalmbpp	CodeCode Available	1
Warm Up Before You Train: Unlocking General Reasoning in Resource-Constrained Settings	May 19, 2025	HumanEvalMath	CodeCode Available	0
HALO: Hierarchical Autonomous Logic-Oriented Orchestration for Multi-Agent LLM Systems	May 17, 2025	Arithmetic ReasoningCode Generation	CodeCode Available	1
Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models	May 15, 2025	Code GenerationGSM8K	—Unverified	0
Rethinking Repetition Problems of LLMs in Code Generation	May 15, 2025	Code GenerationHumanEval	CodeCode Available	1
AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection	May 12, 2025	GSM8KHumanEval	—Unverified	0
Enhancing Code Generation via Bidirectional Comment-Level Mutual Grounding	May 12, 2025	Code GenerationComment Generation	CodeCode Available	0
Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks	May 12, 2025	Code Generation	CodeCode Available	3
CodeMixBench: Evaluating Large Language Models on Code Generation with Code-Mixed Prompts	May 8, 2025	Code CompletionCode Generation	—Unverified	0
Memorization or Interpolation ? Detecting LLM Memorization through Input Perturbation Analysis	May 5, 2025	ArticlesHumanEval	—Unverified	0
Rewriting Pre-Training Data Boosts LLM Performance in Math and Code	May 5, 2025	Code GenerationGSM8K	CodeCode Available	1
The Art of Repair: Optimizing Iterative Program Repair with Instruction-Tuned Models	May 5, 2025	HumanEvalProgram Repair	—Unverified	0
ARCS: Agentic Retrieval-Augmented Code Synthesis with Iterative Refinement	Apr 29, 2025	Code GenerationHumanEval	—Unverified	0
DataDecide: How to Predict Best Pretraining Data with Small Experiments	Apr 15, 2025	ARCHellaSwag	CodeCode Available	3
Type-Constrained Code Generation with Language Models	Apr 12, 2025	Code GenerationHumanEval	—Unverified	0
OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs	Apr 5, 2025	Code GenerationHumanEval	—Unverified	0
Sustainable LLM Inference for Edge AI: Evaluating Quantized LLMs for Energy Efficiency, Output Accuracy, and Inference Latency	Apr 4, 2025	BenchmarkingGSM8K	—Unverified	0
Can LLMs Enable Verification in Mainstream Programming?	Mar 18, 2025	Code GenerationHumanEval	—Unverified	0
Fully Autonomous Programming using Iterative Multi-Agent Debugging with Large Language Models	Mar 10, 2025	HumanEvalProgram Synthesis	—Unverified	0
RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing	Mar 10, 2025	Code GenerationHumanEval	CodeCode Available	1
Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Enhancement Protocol	Mar 7, 2025	BenchmarkingBug fixing	—Unverified	0
Grammar-Based Code Representation: Is It a Worthy Pursuit for LLMs?	Mar 7, 2025	Code GenerationHumanEval	—Unverified	0
ThrowBench: Benchmarking LLMs by Predicting Runtime Exceptions	Mar 6, 2025	BenchmarkingHumanEval	CodeCode Available	0
KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding	Mar 4, 2025	HumanEvalmbpp	CodeCode Available	3
Layer-Aware Task Arithmetic: Disentangling Task-Specific and Instruction-Following Knowledge	Feb 27, 2025	GSM8KHumanEval	—Unverified	0
Isolating Language-Coding from Problem-Solving: Benchmarking LLMs with PseudoEval	Feb 26, 2025	BenchmarkingCode Generation	—Unverified	0
Nexus: A Lightweight and Scalable Multi-Agent Framework for Complex Tasks Automation	Feb 26, 2025	Code GenerationHumanEval	CodeCode Available	2
CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models	Feb 23, 2025	Code GenerationHumanEval	CodeCode Available	1
Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities	Feb 17, 2025	Code GenerationHumanEval	CodeCode Available	1
UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance	Feb 17, 2025	Code GenerationHumanEval	—Unverified	0
MasRouter: Learning to Route LLMs for Multi-Agent Systems	Feb 16, 2025	HumanEvalmbpp	CodeCode Available	2
CopySpec: Accelerating LLMs with Speculative Copy-and-Paste Without Compromising Quality	Feb 13, 2025	8kGPU	CodeCode Available	0

Show:10 25 50

← PrevPage 1 of 6Next →

No leaderboard results yet.