| CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks | Jul 14, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving | Jul 8, 2025 | Code RepairTransfer Learning | CodeCode Available | 3 |
| Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents | May 30, 2025 | BenchmarkingCode Repair | —Unverified | 0 |
| CrashFixer: A crash resolution agent for the Linux kernel | Apr 29, 2025 | Code Repair | —Unverified | 0 |
| How Accurately Do Large Language Models Understand Code? | Apr 6, 2025 | Code GenerationCode Repair | —Unverified | 0 |
| Why Stop at One Error? Benchmarking LLMs as Data Science Code Debuggers for Multi-Hop and Multi-Bug Errors | Mar 28, 2025 | BenchmarkingCode Generation | CodeCode Available | 0 |
| RocketPPA: Code-Level Power, Performance, and Area Prediction via LLM and Mixture of Experts | Mar 27, 2025 | Code RepairFeature Engineering | —Unverified | 0 |
| SolBench: A Dataset and Benchmark for Evaluating Functional Correctness in Solidity Code Completion and Repair | Mar 3, 2025 | Code CompletionCode Repair | —Unverified | 0 |
| AuPair: Golden Example Pairs for Code Repair | Feb 12, 2025 | Code RepairIn-Context Learning | —Unverified | 0 |
| Fortran2CPP: Automating Fortran-to-C++ Translation using LLMs via Multi-Turn Dialogue and Dual-Agent Integration | Dec 27, 2024 | C++ codeCode Repair | CodeCode Available | 1 |