| Isolating Language-Coding from Problem-Solving: Benchmarking LLMs with PseudoEval | Feb 26, 2025 | BenchmarkingCode Generation | —Unverified | 0 | 0 |
| Kotlin ML Pack: Technical Report | May 29, 2024 | Code GenerationHumanEval | —Unverified | 0 | 0 |
| Large Language Model Guided Self-Debugging Code Generation | Feb 5, 2025 | Code GenerationComputational Efficiency | —Unverified | 0 | 0 |
| Layer-Aware Task Arithmetic: Disentangling Task-Specific and Instruction-Following Knowledge | Feb 27, 2025 | GSM8KHumanEval | —Unverified | 0 | 0 |
| Learning How To Ask: Cycle-Consistency Refines Prompts in Multimodal Foundation Models | Feb 13, 2024 | Code GenerationHumanEval | —Unverified | 0 | 0 |
| Learning to Reason via Self-Iterative Process Feedback for Small Language Models | Dec 11, 2024 | Domain GeneralizationGSM8K | —Unverified | 0 | 0 |
| Leveraging Metamemory Mechanisms for Enhanced Data-Free Code Generation in LLMs | Jan 14, 2025 | Code GenerationHumanEval | —Unverified | 0 | 0 |
| LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code | Mar 12, 2024 | Code GenerationHumanEval | —Unverified | 0 | 0 |
| LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models | May 25, 2025 | GSM8KHumanEval | —Unverified | 0 | 0 |
| LoRA-Mixer: Coordinate Modular LoRA Experts Through Serial Attention Routing | Jun 17, 2025 | ARCCoLA | —Unverified | 0 | 0 |
| LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot Compression | Sep 25, 2023 | Code GenerationHumanEval | —Unverified | 0 | 0 |
| Low-Cost Language Models: Survey and Performance Evaluation on Python Code Generation | Apr 17, 2024 | Code GenerationHumanEval | —Unverified | 0 | 0 |
| MaPPing Your Model: Assessing the Impact of Adversarial Attacks on LLM-based Programming Assistants | Jul 12, 2024 | HumanEval | —Unverified | 0 | 0 |
| USCD: Improving Code Generation of LLMs by Uncertainty-Aware Selective Contrastive Decoding | Sep 9, 2024 | Code GenerationHumanEval | —Unverified | 0 | 0 |
| Memorization or Interpolation ? Detecting LLM Memorization through Input Perturbation Analysis | May 5, 2025 | ArticlesHumanEval | —Unverified | 0 | 0 |
| MojoBench: Language Modeling and Benchmarks for Mojo | Oct 23, 2024 | Code GenerationHumanEval | —Unverified | 0 | 0 |
| Mutation-based Consistency Testing for Evaluating the Code Understanding Capability of LLMs | Jan 11, 2024 | Code GenerationHumanEval | —Unverified | 0 | 0 |
| NExT: Teaching Large Language Models to Reason about Code Execution | Apr 23, 2024 | HumanEvalmbpp | —Unverified | 0 | 0 |
| NoFunEval: Funny How Code LMs Falter on Requirements Beyond Functional Correctness | Jan 29, 2024 | HumanEval | —Unverified | 0 | 0 |
| On the Limitations of Embedding Based Methods for Measuring Functional Correctness for Code Generation | Apr 26, 2024 | Code GenerationHumanEval | —Unverified | 0 | 0 |
| OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs | Apr 5, 2025 | Code GenerationHumanEval | —Unverified | 0 | 0 |
| PanGu-Coder2: Boosting Large Language Models for Code with Ranking Feedback | Jul 27, 2023 | Code GenerationHumanEval | —Unverified | 0 | 0 |
| Past as a Guide: Leveraging Retrospective Learning for Python Code Completion | Nov 13, 2023 | Code CompletionHumanEval | —Unverified | 0 | 0 |
| PERC: Plan-As-Query Example Retrieval for Underrepresented Code Generation | Dec 17, 2024 | Code GenerationHumanEval | —Unverified | 0 | 0 |
| Piloting Copilot, Codex, and StarCoder2: Hot Temperature, Cold Prompts, or Black Magic? | Oct 26, 2022 | HumanEvalLanguage Modelling | —Unverified | 0 | 0 |
| Plan for Speed -- Dilated Scheduling for Masked Diffusion Language Models | Jun 23, 2025 | Code CompletionGSM8K | —Unverified | 0 | 0 |
| PLUM: Improving Code LMs with Execution-Guided On-Policy Preference Learning Driven By Synthetic Test Cases | Jun 11, 2024 | Code GenerationHumanEval | —Unverified | 0 | 0 |
| Prior Prompt Engineering for Reinforcement Fine-Tuning | May 20, 2025 | HumanEvalPrompt Engineering | —Unverified | 0 | 0 |
| Qiskit Code Assistant: Training LLMs for generating Quantum Computing Code | May 29, 2024 | HumanEval | —Unverified | 0 | 0 |
| Qiskit HumanEval: An Evaluation Benchmark For Quantum Code Generative Models | Jun 20, 2024 | Code GenerationHumanEval | —Unverified | 0 | 0 |
| QualityFlow: An Agentic Workflow for Program Synthesis Controlled by LLM Quality Checks | Jan 20, 2025 | Code GenerationHumanEval | —Unverified | 0 | 0 |
| Reactor Mk.1 performances: MMLU, HumanEval and BBH test results | Jun 15, 2024 | BenchmarkingHumanEval | —Unverified | 0 | 0 |
| Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment | Feb 5, 2025 | GSM8KHumanEval | —Unverified | 0 | 0 |
| Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models | May 15, 2025 | Code GenerationGSM8K | —Unverified | 0 | 0 |
| RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation | Sep 15, 2024 | Code GenerationHumanEval | —Unverified | 0 | 0 |
| SACL: Understanding and Combating Textual Bias in Code Retrieval with Semantic-Augmented Reranking and Localization | Jun 25, 2025 | Code GenerationHumanEval | —Unverified | 0 | 0 |
| Scattered Forest Search: Smarter Code Space Exploration with LLMs | Oct 22, 2024 | Code GenerationDiversity | —Unverified | 0 | 0 |
| SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity | Dec 30, 2024 | BenchmarkingCode Generation | —Unverified | 0 | 0 |
| Selection of Prompt Engineering Techniques for Code Generation through Predicting Code Complexity | Sep 24, 2024 | Code GenerationContrastive Learning | —Unverified | 0 | 0 |
| SelfEvolve: A Code Evolution Framework via Large Language Models | Jun 5, 2023 | Code GenerationHumanEval | —Unverified | 0 | 0 |
| Self-Evolving Multi-Agent Collaboration Networks for Software Development | Oct 22, 2024 | HumanEval | —Unverified | 0 | 0 |
| Self-Explained Keywords Empower Large Language Models for Code Generation | Oct 21, 2024 | Code GenerationHumanEval | —Unverified | 0 | 0 |
| Semantic-guided Search for Efficient Program Repair with Large Language Models | Oct 22, 2024 | GPUHumanEval | —Unverified | 0 | 0 |
| TaskEval: Assessing Difficulty of Code Generation Tasks for Large Language Models | Jul 30, 2024 | BenchmarkingCode Completion | —Unverified | 0 | 0 |
| SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths | May 30, 2024 | GSM8KHumanEval | —Unverified | 0 | 0 |
| Stochastic Code Generation | Apr 14, 2023 | Code GenerationDecoder | —Unverified | 0 | 0 |
| Sustainable LLM Inference for Edge AI: Evaluating Quantized LLMs for Energy Efficiency, Output Accuracy, and Inference Latency | Apr 4, 2025 | BenchmarkingGSM8K | —Unverified | 0 | 0 |
| SwiftEval: Developing a Language-Specific Benchmark for LLM-generated Code Evaluation | May 30, 2025 | Code GenerationHumanEval | —Unverified | 0 | 0 |
| Synthesize, Partition, then Adapt: Eliciting Diverse Samples from Foundation Models | Nov 11, 2024 | Code GenerationHumanEval | —Unverified | 0 | 0 |
| Test-Driven Development for Code Generation | Feb 21, 2024 | Code GenerationHumanEval | —Unverified | 0 | 0 |