| Invisible Entropy: Towards Safe and Efficient Low-Entropy LLM Watermarking | May 20, 2025 | HumanEvalmbpp | CodeCode Available | 1 |
| RES-Q: Evaluating Code-Editing Large Language Model Systems at the Repository Scale | Jun 24, 2024 | Code GenerationHumanEval | CodeCode Available | 1 |
| Concept Distillation from Strong to Weak Models via Hypotheses-to-Theories Prompting | Aug 18, 2024 | HumanEvalMathematical Reasoning | —Unverified | 0 |
| Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Enhancement Protocol | Mar 7, 2025 | BenchmarkingBug fixing | —Unverified | 0 |
| Addressing Data Leakage in HumanEval Using Combinatorial Test Design | Dec 2, 2024 | HumanEval | —Unverified | 0 |
| BASS: Batched Attention-optimized Speculative Sampling | Apr 24, 2024 | GPUHumanEval | —Unverified | 0 |
| CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models | Nov 7, 2024 | Code GenerationDecision Making | —Unverified | 0 |
| AutoTest: Evolutionary Code Solution Selection with Test Cases | Aug 22, 2024 | Code GenerationHumanEval | —Unverified | 0 |
| An LLM-as-Judge Metric for Bridging the Gap with Human Evaluation in SE Tasks | May 27, 2025 | Code GenerationCode Summarization | —Unverified | 0 |
| Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models | Dec 18, 2024 | HumanEvalImitation Learning | —Unverified | 0 |