| CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation | Feb 9, 2021 | BIG-bench Machine LearningClone Detection | CodeCode Available | 1 |
| MACER: A Modular Framework for Accelerated Compilation Error Repair | May 28, 2020 | 4kCode Repair | CodeCode Available | 1 |
| CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks | Jul 14, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents | May 30, 2025 | BenchmarkingCode Repair | —Unverified | 0 |
| CrashFixer: A crash resolution agent for the Linux kernel | Apr 29, 2025 | Code Repair | —Unverified | 0 |
| How Accurately Do Large Language Models Understand Code? | Apr 6, 2025 | Code GenerationCode Repair | —Unverified | 0 |
| Why Stop at One Error? Benchmarking LLMs as Data Science Code Debuggers for Multi-Hop and Multi-Bug Errors | Mar 28, 2025 | BenchmarkingCode Generation | CodeCode Available | 0 |
| RocketPPA: Code-Level Power, Performance, and Area Prediction via LLM and Mixture of Experts | Mar 27, 2025 | Code RepairFeature Engineering | —Unverified | 0 |
| SolBench: A Dataset and Benchmark for Evaluating Functional Correctness in Solidity Code Completion and Repair | Mar 3, 2025 | Code CompletionCode Repair | —Unverified | 0 |
| AuPair: Golden Example Pairs for Code Repair | Feb 12, 2025 | Code RepairIn-Context Learning | —Unverified | 0 |