SOTAVerified|Agents Browse Leaderboard About

HumanEval

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 126–150 of 264 papers

Title	Date	Tasks	Status	Hype
ARCS: Agentic Retrieval-Augmented Code Synthesis with Iterative Refinement	Apr 29, 2025	Code GenerationHumanEval	—Unverified	0
Type-Constrained Code Generation with Language Models	Apr 12, 2025	Code GenerationHumanEval	—Unverified	0
OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs	Apr 5, 2025	Code GenerationHumanEval	—Unverified	0
Sustainable LLM Inference for Edge AI: Evaluating Quantized LLMs for Energy Efficiency, Output Accuracy, and Inference Latency	Apr 4, 2025	BenchmarkingGSM8K	—Unverified	0
Can LLMs Enable Verification in Mainstream Programming?	Mar 18, 2025	Code GenerationHumanEval	—Unverified	0
Fully Autonomous Programming using Iterative Multi-Agent Debugging with Large Language Models	Mar 10, 2025	HumanEvalProgram Synthesis	—Unverified	0
Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Enhancement Protocol	Mar 7, 2025	BenchmarkingBug fixing	—Unverified	0
Grammar-Based Code Representation: Is It a Worthy Pursuit for LLMs?	Mar 7, 2025	Code GenerationHumanEval	—Unverified	0
ThrowBench: Benchmarking LLMs by Predicting Runtime Exceptions	Mar 6, 2025	BenchmarkingHumanEval	CodeCode Available	0
Layer-Aware Task Arithmetic: Disentangling Task-Specific and Instruction-Following Knowledge	Feb 27, 2025	GSM8KHumanEval	—Unverified	0
Isolating Language-Coding from Problem-Solving: Benchmarking LLMs with PseudoEval	Feb 26, 2025	BenchmarkingCode Generation	—Unverified	0
UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance	Feb 17, 2025	Code GenerationHumanEval	—Unverified	0
CopySpec: Accelerating LLMs with Speculative Copy-and-Paste Without Compromising Quality	Feb 13, 2025	8kGPU	CodeCode Available	0
Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment	Feb 5, 2025	GSM8KHumanEval	—Unverified	0
Large Language Model Guided Self-Debugging Code Generation	Feb 5, 2025	Code GenerationComputational Efficiency	—Unverified	0
ACECODER: Acing Coder RL via Automated Test-Case Synthesis	Feb 3, 2025	HumanEvalmbpp	—Unverified	0
Importing Phantoms: Measuring LLM Package Hallucination Vulnerabilities	Jan 31, 2025	Code GenerationHallucination	—Unverified	0
CoCoNUT: Structural Code Understanding does not fall out of a tree	Jan 27, 2025	Code GenerationHumanEval	CodeCode Available	0
QualityFlow: An Agentic Workflow for Program Synthesis Controlled by LLM Quality Checks	Jan 20, 2025	Code GenerationHumanEval	—Unverified	0
Leveraging Metamemory Mechanisms for Enhanced Data-Free Code Generation in LLMs	Jan 14, 2025	Code GenerationHumanEval	—Unverified	0
Guided Code Generation with LLMs: A Multi-Agent Framework for Complex Code Tasks	Jan 11, 2025	Code GenerationHumanEval	—Unverified	0
Dafny as Verification-Aware Intermediate Language for Code Generation	Jan 10, 2025	Code GenerationHumanEval	—Unverified	0
InfiFusion: A Unified Framework for Enhanced Cross-Model Reasoning via LLM Fusion	Jan 6, 2025	GSM8KHumanEval	—Unverified	0
Dynamic Scaling of Unit Tests for Code Reward Modeling	Jan 2, 2025	Code GenerationHumanEval	—Unverified	0
SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity	Dec 30, 2024	BenchmarkingCode Generation	—Unverified	0

Show:10 25 50

← PrevPage 6 of 11Next →

No leaderboard results yet.