| any4: Learned 4-bit Numeric Representation for LLMs | Jul 7, 2025 | GPUGSM8K | CodeCode Available | 2 |
| EvoAgentX: An Automated Framework for Evolving Agentic Workflows | Jul 4, 2025 | Code GenerationMath | CodeCode Available | 7 |
| SACL: Understanding and Combating Textual Bias in Code Retrieval with Semantic-Augmented Reranking and Localization | Jun 25, 2025 | Code GenerationHumanEval | —Unverified | 0 |
| Plan for Speed -- Dilated Scheduling for Masked Diffusion Language Models | Jun 23, 2025 | Code CompletionGSM8K | —Unverified | 0 |
| Enhancing Reasoning Capabilities of Small Language Models with Blueprints and Prompt Template Search | Jun 10, 2025 | GSM8KMath | —Unverified | 0 |
| Guideline Forest: Experience-Induced Multi-Guideline Reasoning with Stepwise Aggregation | Jun 9, 2025 | GSM8KHumanEval | —Unverified | 0 |
| Enhancing LLM-Based Code Generation with Complexity Metrics: A Feedback-Driven Approach | May 29, 2025 | Code GenerationHumanEval | —Unverified | 0 |
| Self-Correcting Code Generation Using Small Language Models | May 29, 2025 | Code GenerationHumanEval | CodeCode Available | 0 |
| LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models | May 25, 2025 | GSM8KHumanEval | —Unverified | 0 |
| Invisible Entropy: Towards Safe and Efficient Low-Entropy LLM Watermarking | May 20, 2025 | HumanEvalmbpp | CodeCode Available | 1 |
| Rethinking Repetition Problems of LLMs in Code Generation | May 15, 2025 | Code GenerationHumanEval | CodeCode Available | 1 |
| Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models | May 15, 2025 | Code GenerationGSM8K | —Unverified | 0 |
| Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks | May 12, 2025 | Code Generation | CodeCode Available | 3 |
| CodeMixBench: Evaluating Large Language Models on Code Generation with Code-Mixed Prompts | May 8, 2025 | Code CompletionCode Generation | —Unverified | 0 |
| DataDecide: How to Predict Best Pretraining Data with Small Experiments | Apr 15, 2025 | ARCHellaSwag | CodeCode Available | 3 |
| Two Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning | Apr 14, 2025 | Mathematical Reasoningmbpp | CodeCode Available | 2 |
| Type-Constrained Code Generation with Language Models | Apr 12, 2025 | Code GenerationHumanEval | —Unverified | 0 |
| OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs | Apr 5, 2025 | Code GenerationHumanEval | —Unverified | 0 |
| DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation | Mar 13, 2025 | Code Generationmbpp | —Unverified | 0 |
| Grammar-Based Code Representation: Is It a Worthy Pursuit for LLMs? | Mar 7, 2025 | Code GenerationHumanEval | —Unverified | 0 |
| KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding | Mar 4, 2025 | HumanEvalmbpp | CodeCode Available | 3 |
| Isolating Language-Coding from Problem-Solving: Benchmarking LLMs with PseudoEval | Feb 26, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models | Feb 23, 2025 | Code GenerationHumanEval | CodeCode Available | 1 |
| Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning | Feb 19, 2025 | mbpp | —Unverified | 0 |
| UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance | Feb 17, 2025 | Code GenerationHumanEval | —Unverified | 0 |