| Learning What Matters: Probabilistic Task Selection via Mutual Information for Model Finetuning | Jul 16, 2025 | DiversityMMLU | —Unverified | 0 |
| Step-wise Policy for Rare-tool Knowledge (SPaRK): Offline RL that Drives Diverse Tool Use in LLMs | Jul 15, 2025 | DiversityMMLU | CodeCode Available | 0 |
| Lizard: An Efficient Linearization Framework for Large Language Models | Jul 11, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Integrating External Tools with Large Language Models to Improve Accuracy | Jul 9, 2025 | Mathematical ReasoningMMLU | —Unverified | 0 |
| Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate | Jul 8, 2025 | Continual LearningMixture-of-Experts | CodeCode Available | 0 |
| Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations | Jul 7, 2025 | AttributeMMLU | CodeCode Available | 0 |
| Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training | Jul 7, 2025 | General KnowledgeMMLU | —Unverified | 0 |
| Multi-lingual Functional Evaluation for Large Language Models | Jun 25, 2025 | BelebeleInstruction Following | —Unverified | 0 |
| Enterprise Large Language Model Evaluation Benchmark | Jun 25, 2025 | Language Model EvaluationLanguage Modeling | —Unverified | 0 |
| Biomed-Enriched: A Biomedical Dataset Enriched with LLMs for Pretraining and Extracting Rare and Hidden Content | Jun 25, 2025 | ArticlesContinual Pretraining | —Unverified | 0 |
| Gazal-R1: Achieving State-of-the-Art Medical Reasoning with Parameter-Efficient Two-Stage Training | Jun 18, 2025 | MedQAMMLU | —Unverified | 0 |
| Surface Fairness, Deep Bias: A Comparative Study of Bias in Language Models | Jun 12, 2025 | FairnessMMLU | —Unverified | 0 |
| Slimming Down LLMs Without Losing Their Minds | Jun 12, 2025 | Computational EfficiencyGSM8K | —Unverified | 0 |
| MoE-GPS: Guidlines for Prediction Strategy for Dynamic Expert Duplication in MoE Load Balancing | Jun 9, 2025 | GPUMixture-of-Experts | —Unverified | 0 |
| From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment | Jun 7, 2025 | ARCMMLU | —Unverified | 0 |
| Automatic Robustness Stress Testing of LLMs as Mathematical Problem Solvers | Jun 5, 2025 | GSM8KMath | —Unverified | 0 |
| GEM: Empowering LLM for both Embedding Generation and Language Understanding | Jun 4, 2025 | DecoderLarge Language Model | —Unverified | 0 |
| Do Language Models Mirror Human Confidence? Exploring Psychological Insights to Address Overconfidence in LLMs | May 31, 2025 | MMLU | CodeCode Available | 0 |
| Model Unlearning via Sparse Autoencoder Subspace Guided Projections | May 30, 2025 | Adversarial Robustnessfeature selection | —Unverified | 0 |
| Simulating Training Data Leakage in Multiple-Choice Benchmarks for LLM Evaluation | May 30, 2025 | Continual PretrainingFairness | CodeCode Available | 0 |
| Revisiting Uncertainty Estimation and Calibration of Large Language Models | May 29, 2025 | Mixture-of-ExpertsMMLU | —Unverified | 0 |
| DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors | May 29, 2025 | MMLUMultiple-choice | CodeCode Available | 0 |
| Actor-Critic based Online Data Mixing For Language Model Pre-Training | May 29, 2025 | HumanEvalLanguage Modeling | —Unverified | 0 |
| Large Language Models Often Know When They Are Being Evaluated | May 28, 2025 | MMLUMultiple-choice | —Unverified | 0 |
| Capability-Based Scaling Laws for LLM Red-Teaming | May 26, 2025 | MMLUPrompt Engineering | CodeCode Available | 0 |
| Interleaved Reasoning for Large Language Models via Reinforcement Learning | May 26, 2025 | Logical ReasoningMath | —Unverified | 0 |
| The Price of Format: Diversity Collapse in LLMs | May 25, 2025 | DiversityGSM8K | CodeCode Available | 0 |
| Efficient Data Selection at Scale via Influence Distillation | May 25, 2025 | GSM8KMMLU | —Unverified | 0 |
| BnMMLU: Measuring Massive Multitask Language Understanding in Bengali | May 25, 2025 | General KnowledgeMMLU | CodeCode Available | 0 |
| B-score: Detecting biases in large language models using response history | May 24, 2025 | MMLU | —Unverified | 0 |
| LoTA-QAF: Lossless Ternary Adaptation for Quantization-Aware Fine-Tuning | May 24, 2025 | Computational EfficiencyMMLU | CodeCode Available | 0 |
| INFERENCEDYNAMICS: Efficient Routing Across LLMs through Structured Capability and Knowledge Profiling | May 22, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Cost-aware LLM-based Online Dataset Annotation | May 21, 2025 | MMLU | —Unverified | 0 |
| Context Reasoner: Incentivizing Reasoning Capability for Contextualized Privacy and Safety Compliance via Reinforcement Learning | May 20, 2025 | MMLUReinforcement Learning (RL) | —Unverified | 0 |
| Dual Decomposition of Weights and Singular Value Low Rank Adaptation | May 20, 2025 | GSM8KMMLU | —Unverified | 0 |
| Self-Reasoning Language Models: Unfold Hidden Reasoning Chains with Few Reasoning Catalyst | May 20, 2025 | ARCGSM8K | —Unverified | 0 |
| Void in Language Models | May 20, 2025 | MMLUResponse Generation | CodeCode Available | 0 |
| Warm Up Before You Train: Unlocking General Reasoning in Resource-Constrained Settings | May 19, 2025 | HumanEvalMath | CodeCode Available | 0 |
| Critique-Guided Distillation: Improving Supervised Fine-tuning via Better Distillation | May 16, 2025 | MathMMLU | —Unverified | 0 |
| Review-Instruct: A Review-Driven Multi-Turn Conversations Generation Method for Large Language Models | May 16, 2025 | DiversityMMLU | CodeCode Available | 0 |
| Mining Hidden Thoughts from Texts: Evaluating Continual Pretraining with Synthetic Data for LLM Reasoning | May 15, 2025 | Continual PretrainingMMLU | —Unverified | 0 |
| KRISTEVA: Close Reading as a Novel Task for Benchmarking Interpretive Reasoning | May 14, 2025 | BenchmarkingMMLU | —Unverified | 0 |
| AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection | May 12, 2025 | GSM8KHumanEval | —Unverified | 0 |
| SEM: Reinforcement Learning for Search-Efficient Large Language Models | May 12, 2025 | MMLUreinforcement-learning | —Unverified | 0 |
| A Scaling Law for Token Efficiency in LLM Fine-Tuning Under Fixed Compute Budgets | May 9, 2025 | MMLU | —Unverified | 0 |
| Elastic Weight Consolidation for Full-Parameter Continual Pre-Training of Gemma2 | May 9, 2025 | ARCBelebele | —Unverified | 0 |
| LLMs Outperform Experts on Challenging Biology Benchmarks | May 9, 2025 | MMLUVirology | —Unverified | 0 |
| Measuring Hong Kong Massive Multi-Task Language Understanding | May 4, 2025 | MMLUMulti-task Language Understanding | —Unverified | 0 |
| Memory-Efficient LLM Training by Various-Grained Low-Rank Projection of Gradients | May 3, 2025 | GSM8KMMLU | —Unverified | 0 |
| LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception | Apr 21, 2025 | MathMMLU | —Unverified | 0 |