| DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code Generation | Aug 23, 2024 | Code GenerationHumanEval | —Unverified | 0 |
| CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution | Aug 23, 2024 | Code GenerationHumanEval | —Unverified | 0 |
| AutoTest: Evolutionary Code Solution Selection with Test Cases | Aug 22, 2024 | Code GenerationHumanEval | —Unverified | 0 |
| Threshold Filtering Packing for Supervised Fine-Tuning: Training Related Samples within Packs | Aug 18, 2024 | DiversityGPU | —Unverified | 0 |
| Concept Distillation from Strong to Weak Models via Hypotheses-to-Theories Prompting | Aug 18, 2024 | HumanEvalMathematical Reasoning | —Unverified | 0 |
| CodeMirage: Hallucinations in Code Generated by Large Language Models | Aug 14, 2024 | Code GenerationHallucination | —Unverified | 0 |
| CREST: Effectively Compacting a Datastore For Retrieval-Based Speculative Decoding | Aug 8, 2024 | HumanEvalRetrieval | —Unverified | 0 |
| CodexGraph: Bridging Large Language Models and Code Repositories via Code Graph Databases | Aug 7, 2024 | HumanEvalmbpp | CodeCode Available | 7 |
| ArchCode: Incorporating Software Requirements in Code Generation with Large Language Models | Aug 2, 2024 | Code GenerationHumanEval | CodeCode Available | 1 |
| TaskEval: Assessing Difficulty of Code Generation Tasks for Large Language Models | Jul 30, 2024 | BenchmarkingCode Completion | —Unverified | 0 |