| One Language, Many Gaps: Evaluating Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks | Oct 14, 2024 | FairnessGSM8K | CodeCode Available | 0 |
| KV Prediction for Improved Time to First Token | Oct 10, 2024 | Code CompletionCPU | CodeCode Available | 0 |
| Context-Augmented Code Generation Using Programming Knowledge Graphs | Oct 9, 2024 | Code GenerationHumanEval | —Unverified | 0 |
| AIME: AI System Optimization via Multiple LLM Evaluators | Oct 4, 2024 | Code GenerationHumanEval | —Unverified | 0 |
| Training Language Models on Synthetic Edit Sequences Improves Code Synthesis | Oct 3, 2024 | HumanEvalSynthetic Data Generation | CodeCode Available | 1 |
| RGD: Multi-LLM Based Agent Debugger via Refinement and Generation Guidance | Oct 2, 2024 | Code GenerationHumanEval | CodeCode Available | 0 |
| From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging | Oct 2, 2024 | Auto DebuggingBug fixing | CodeCode Available | 2 |
| AMR-Evol: Adaptive Modular Response Evolution Elicits Better Knowledge Distillation for Large Language Models in Code Generation | Oct 1, 2024 | Code GenerationHumanEval | CodeCode Available | 0 |
| Selection of Prompt Engineering Techniques for Code Generation through Predicting Code Complexity | Sep 24, 2024 | Code GenerationContrastive Learning | —Unverified | 0 |
| Training Language Models to Self-Correct via Reinforcement Learning | Sep 19, 2024 | HumanEvalMath | CodeCode Available | 2 |
| GRIN: GRadient-INformed MoE | Sep 18, 2024 | HellaSwagHumanEval | —Unverified | 0 |
| RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation | Sep 15, 2024 | Code GenerationHumanEval | —Unverified | 0 |
| Measuring the Influence of Incorrect Code on Test Generation | Sep 14, 2024 | HumanEvalLarge Language Model | CodeCode Available | 0 |
| CPL: Critical Plan Step Learning Boosts LLM Generalization in Reasoning Tasks | Sep 13, 2024 | ARCCode Generation | —Unverified | 0 |
| Policy Filtration in RLHF to Fine-Tune LLM for Code Generation | Sep 11, 2024 | Code GenerationHumanEval | CodeCode Available | 1 |
| USCD: Improving Code Generation of LLMs by Uncertainty-Aware Selective Contrastive Decoding | Sep 9, 2024 | Code GenerationHumanEval | —Unverified | 0 |
| Multi-Programming Language Ensemble for Code Generation in Large Language Model | Sep 6, 2024 | Code GenerationHumanEval | CodeCode Available | 0 |
| How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data | Sep 5, 2024 | Code GenerationDiversity | CodeCode Available | 1 |
| Planning In Natural Language Improves LLM Search For Code Generation | Sep 5, 2024 | Code GenerationDiversity | CodeCode Available | 1 |
| Arctic-SnowCoder: Demystifying High-Quality Data in Code Pretraining | Sep 3, 2024 | Code GenerationHumanEval | —Unverified | 0 |
| DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code Generation | Aug 23, 2024 | Code GenerationHumanEval | —Unverified | 0 |
| CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution | Aug 23, 2024 | Code GenerationHumanEval | —Unverified | 0 |
| AutoTest: Evolutionary Code Solution Selection with Test Cases | Aug 22, 2024 | Code GenerationHumanEval | —Unverified | 0 |
| Threshold Filtering Packing for Supervised Fine-Tuning: Training Related Samples within Packs | Aug 18, 2024 | DiversityGPU | —Unverified | 0 |
| Concept Distillation from Strong to Weak Models via Hypotheses-to-Theories Prompting | Aug 18, 2024 | HumanEvalMathematical Reasoning | —Unverified | 0 |
| CodeMirage: Hallucinations in Code Generated by Large Language Models | Aug 14, 2024 | Code GenerationHallucination | —Unverified | 0 |
| CREST: Effectively Compacting a Datastore For Retrieval-Based Speculative Decoding | Aug 8, 2024 | HumanEvalRetrieval | —Unverified | 0 |
| CodexGraph: Bridging Large Language Models and Code Repositories via Code Graph Databases | Aug 7, 2024 | HumanEvalmbpp | CodeCode Available | 7 |
| ArchCode: Incorporating Software Requirements in Code Generation with Large Language Models | Aug 2, 2024 | Code GenerationHumanEval | CodeCode Available | 1 |
| TaskEval: Assessing Difficulty of Code Generation Tasks for Large Language Models | Jul 30, 2024 | BenchmarkingCode Completion | —Unverified | 0 |
| Discrete Flow Matching | Jul 22, 2024 | HumanEvalmbpp | —Unverified | 0 |
| Scaling Granite Code Models to 128K Context | Jul 18, 2024 | 2k4k | CodeCode Available | 4 |
| Qwen2 Technical Report | Jul 15, 2024 | Arithmetic ReasoningGSM8K | CodeCode Available | 13 |
| MaPPing Your Model: Assessing the Impact of Adversarial Attacks on LLM-based Programming Assistants | Jul 12, 2024 | HumanEval | —Unverified | 0 |
| InverseCoder: Self-improving Instruction-Tuned Code LLMs with Inverse-Instruct | Jul 8, 2024 | Code GenerationCode Summarization | CodeCode Available | 1 |
| Brevity is the soul of wit: Pruning long files for code generation | Jun 29, 2024 | Code GenerationHumanEval | —Unverified | 0 |
| Towards Large Language Model Aided Program Refinement | Jun 26, 2024 | HumanEvalLanguage Modeling | —Unverified | 0 |
| RES-Q: Evaluating Code-Editing Large Language Model Systems at the Repository Scale | Jun 24, 2024 | Code GenerationHumanEval | CodeCode Available | 1 |
| Qiskit HumanEval: An Evaluation Benchmark For Quantum Code Generative Models | Jun 20, 2024 | Code GenerationHumanEval | —Unverified | 0 |
| Code-Optimise: Self-Generated Preference Data for Correctness and Efficiency | Jun 18, 2024 | HumanEvalmbpp | —Unverified | 0 |
| ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools | Jun 18, 2024 | AllGSM8K | CodeCode Available | 14 |
| ShareLoRA: Parameter Efficient and Robust Large Language Model Fine-tuning via Shared Low-Rank Adaptation | Jun 16, 2024 | Continual LearningGSM8K | CodeCode Available | 0 |
| Reactor Mk.1 performances: MMLU, HumanEval and BBH test results | Jun 15, 2024 | BenchmarkingHumanEval | —Unverified | 0 |
| PLUM: Improving Code LMs with Execution-Guided On-Policy Preference Learning Driven By Synthetic Test Cases | Jun 11, 2024 | Code GenerationHumanEval | —Unverified | 0 |
| Validating LLM-Generated Programs with Metamorphic Prompt Testing | Jun 11, 2024 | HumanEval | —Unverified | 0 |
| JavaBench: A Benchmark of Object-Oriented Code Generation for Evaluating Large Language Models | Jun 10, 2024 | BenchmarkingCode Generation | CodeCode Available | 0 |
| How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark | Jun 10, 2024 | HumanEvalProgram Synthesis | CodeCode Available | 1 |
| Does your data spark joy? Performance gains from domain upsampling at the end of training | Jun 5, 2024 | GSM8KHumanEval | —Unverified | 0 |
| SemCoder: Training Code Language Models with Comprehensive Semantics Reasoning | Jun 3, 2024 | Code CompletionCode Generation | CodeCode Available | 1 |
| Automatic Instruction Evolving for Large Language Models | Jun 2, 2024 | GSM8KHumanEval | CodeCode Available | 3 |