| Actor-Critic based Online Data Mixing For Language Model Pre-Training | May 29, 2025 | HumanEvalLanguage Modeling | —Unverified | 0 |
| DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors | May 29, 2025 | MMLUMultiple-choice | CodeCode Available | 0 |
| Large Language Models Often Know When They Are Being Evaluated | May 28, 2025 | MMLUMultiple-choice | —Unverified | 0 |
| Reinforcing General Reasoning without Verifiers | May 27, 2025 | MathMathematical Reasoning | CodeCode Available | 2 |
| Capability-Based Scaling Laws for LLM Red-Teaming | May 26, 2025 | MMLUPrompt Engineering | CodeCode Available | 0 |
| Interleaved Reasoning for Large Language Models via Reinforcement Learning | May 26, 2025 | Logical ReasoningMath | —Unverified | 0 |
| The Price of Format: Diversity Collapse in LLMs | May 25, 2025 | DiversityGSM8K | CodeCode Available | 0 |
| BnMMLU: Measuring Massive Multitask Language Understanding in Bengali | May 25, 2025 | General KnowledgeMMLU | CodeCode Available | 0 |
| Efficient Data Selection at Scale via Influence Distillation | May 25, 2025 | GSM8KMMLU | —Unverified | 0 |
| LoTA-QAF: Lossless Ternary Adaptation for Quantization-Aware Fine-Tuning | May 24, 2025 | Computational EfficiencyMMLU | CodeCode Available | 0 |
| B-score: Detecting biases in large language models using response history | May 24, 2025 | MMLU | —Unverified | 0 |
| INFERENCEDYNAMICS: Efficient Routing Across LLMs through Structured Capability and Knowledge Profiling | May 22, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Training Step-Level Reasoning Verifiers with Formal Verification Tools | May 21, 2025 | Formal LogicMath | CodeCode Available | 1 |
| Cost-aware LLM-based Online Dataset Annotation | May 21, 2025 | MMLU | —Unverified | 0 |
| Dual Decomposition of Weights and Singular Value Low Rank Adaptation | May 20, 2025 | GSM8KMMLU | —Unverified | 0 |
| Self-Reasoning Language Models: Unfold Hidden Reasoning Chains with Few Reasoning Catalyst | May 20, 2025 | ARCGSM8K | —Unverified | 0 |
| Context Reasoner: Incentivizing Reasoning Capability for Contextualized Privacy and Safety Compliance via Reinforcement Learning | May 20, 2025 | MMLUReinforcement Learning (RL) | —Unverified | 0 |
| Void in Language Models | May 20, 2025 | MMLUResponse Generation | CodeCode Available | 0 |
| General-Reasoner: Advancing LLM Reasoning Across All Domains | May 20, 2025 | AllMath | CodeCode Available | 3 |
| Warm Up Before You Train: Unlocking General Reasoning in Resource-Constrained Settings | May 19, 2025 | HumanEvalMath | CodeCode Available | 0 |
| Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language Models | May 19, 2025 | BenchmarkingChatbot | CodeCode Available | 1 |
| HALO: Hierarchical Autonomous Logic-Oriented Orchestration for Multi-Agent LLM Systems | May 17, 2025 | Arithmetic ReasoningCode Generation | CodeCode Available | 1 |
| Tiny QA Benchmark++: Ultra-Lightweight, Synthetic Multilingual Dataset Generation & Smoke-Tests for Continuous LLM Evaluation | May 17, 2025 | Dataset GenerationGPU | CodeCode Available | 1 |
| Critique-Guided Distillation: Improving Supervised Fine-tuning via Better Distillation | May 16, 2025 | MathMMLU | —Unverified | 0 |
| MedCaseReasoning: Evaluating and learning diagnostic reasoning from clinical case reports | May 16, 2025 | DiagnosticMath | CodeCode Available | 1 |