| Setting Standards in Turkish NLP: TR-MMLU for Large Language Model Evaluation | Dec 31, 2024 | Language Model EvaluationLanguage Modeling | —Unverified | 0 |
| Slimming Down LLMs Without Losing Their Minds | Jun 12, 2025 | Computational EfficiencyGSM8K | —Unverified | 0 |
| YAYI 2: Multilingual Open-Source Large Language Models | Dec 22, 2023 | MMLU | —Unverified | 0 |
| Spanish and LLM Benchmarks: is MMLU Lost in Translation? | May 28, 2024 | MMLUTranslation | —Unverified | 0 |
| SSR: Alignment-Aware Modality Connector for Speech Language Models | Sep 30, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Correctness Coverage Evaluation for Medical Multiple-Choice Question Answering Based on the Enhanced Conformal Prediction Framework | Mar 7, 2025 | Conformal PredictionMedical Question Answering | —Unverified | 0 |
| Step Guided Reasoning: Improving Mathematical Reasoning using Guidance Generation and Step Reasoning | Oct 18, 2024 | MathMathematical Reasoning | —Unverified | 0 |
| SuperBPE: Space Travel for Language Models | Mar 17, 2025 | Inductive BiasMMLU | —Unverified | 0 |
| Surface Fairness, Deep Bias: A Comparative Study of Bias in Language Models | Jun 12, 2025 | FairnessMMLU | —Unverified | 0 |
| SUTRA: Scalable Multilingual Language Model Architecture | May 7, 2024 | Computational EfficiencyHallucination | —Unverified | 0 |
| Swallowing the Poison Pills: Insights from Vulnerability Disparity Among LLMs | Feb 23, 2025 | Data PoisoningDiagnostic | —Unverified | 0 |
| Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning | Mar 7, 2025 | GPUMath | —Unverified | 0 |
| Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models | Oct 9, 2023 | MMLU | —Unverified | 0 |
| TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language Modeling Likewise | Oct 29, 2023 | Data AugmentationLanguage Modeling | —Unverified | 0 |
| The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback | Oct 31, 2023 | GSM8KMMLU | —Unverified | 0 |
| The Base-Rate Effect on LLM Benchmark Performance: Disambiguating Test-Taking Strategies from Benchmark Performance | Jun 17, 2024 | counterfactualMMLU | —Unverified | 0 |
| The Claude 3 Model Family: Opus, Sonnet, Haiku | Mar 4, 2024 | 1 Image, 2*2 StitchingArithmetic Reasoning | —Unverified | 0 |
| The Poison of Alignment | Aug 25, 2023 | MMLU | —Unverified | 0 |
| The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance? | Dec 2, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Uncovering Latent Chain of Thought Vectors in Language Models | Sep 21, 2024 | ARCGSM8K | —Unverified | 0 |
| Tokenization Standards for Linguistic Integrity: Turkish as a Benchmark | Feb 10, 2025 | MMLUMorphological Analysis | —Unverified | 0 |
| Towards Multilingual LLM Evaluation for European Languages | Oct 11, 2024 | ARCGSM8K | —Unverified | 0 |
| Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception | Feb 17, 2025 | MMLUNatural Questions | —Unverified | 0 |
| Towards Uncertainty-Aware Language Agent | Jan 25, 2024 | MMLUStrategyQA | —Unverified | 0 |
| Transcending Scaling Laws with 0.1% Extra Compute | Oct 20, 2022 | Arithmetic ReasoningCross-Lingual Question Answering | —Unverified | 0 |
| Transferable text data distillation by trajectory matching | Apr 14, 2025 | ARCLarge Language Model | —Unverified | 0 |
| Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests | Feb 20, 2025 | Logical ReasoningMMLU | —Unverified | 0 |
| Understanding Finetuning for Factual Knowledge Extraction | Jun 20, 2024 | MMLUQuestion Answering | —Unverified | 0 |
| Universality of Layer-Level Entropy-Weighted Quantization Beyond Model Architecture and Size | Mar 6, 2025 | MMLUQuantization | —Unverified | 0 |
| Unraveling Indirect In-Context Learning Using Influence Functions | Jan 1, 2025 | In-Context LearningInformativeness | —Unverified | 0 |
| Evaluating Mathematical Reasoning Across Large Language Models: A Fine-Grained Approach | Mar 13, 2025 | Formal LogicMathematical Reasoning | —Unverified | 0 |
| Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs | Dec 17, 2024 | MMLU | —Unverified | 0 |
| Upcycling Large Language Models into Mixture of Experts | Oct 10, 2024 | Mixture-of-ExpertsMMLU | —Unverified | 0 |
| Biomed-Enriched: A Biomedical Dataset Enriched with LLMs for Pretraining and Extracting Rare and Hidden Content | Jun 25, 2025 | ArticlesContinual Pretraining | —Unverified | 0 |
| Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination | Sep 19, 2024 | General KnowledgeMMLU | —Unverified | 0 |
| BrainTransformers: SNN-LLM | Oct 3, 2024 | ARCGSM8K | —Unverified | 0 |
| B-score: Detecting biases in large language models using response history | May 24, 2025 | MMLU | —Unverified | 0 |
| ChainRank-DPO: Chain Rank Direct Preference Optimization for LLM Rankers | Dec 18, 2024 | MMLUReranking | —Unverified | 0 |
| Changing Answer Order Can Decrease MMLU Accuracy | Jun 27, 2024 | MMLUMultiple-choice | —Unverified | 0 |
| Memory-Efficient LLM Training by Various-Grained Low-Rank Projection of Gradients | May 3, 2025 | GSM8KMMLU | —Unverified | 0 |
| MIND: Math Informed syNthetic Dialogues for Pretraining LLMs | Oct 15, 2024 | GSM8KMath | —Unverified | 0 |
| Mining Hidden Thoughts from Texts: Evaluating Continual Pretraining with Synthetic Data for LLM Reasoning | May 15, 2025 | Continual PretrainingMMLU | —Unverified | 0 |
| MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures | Jun 3, 2024 | ChatbotMMLU | —Unverified | 0 |
| MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design | Dec 19, 2024 | MMLUQuantization | —Unverified | 0 |
| Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference | Nov 27, 2024 | GSM8KLanguage Modeling | —Unverified | 0 |
| Watson: A Cognitive Observability Framework for the Reasoning of LLM-Powered Agents | Nov 5, 2024 | MMLU | —Unverified | 0 |
| An Assessment of Model-On-Model Deception | May 10, 2024 | MMLUmodel | —Unverified | 0 |
| MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation | Mar 13, 2025 | Language Model EvaluationLanguage Modeling | —Unverified | 0 |
| Model Unlearning via Sparse Autoencoder Subspace Guided Projections | May 30, 2025 | Adversarial Robustnessfeature selection | —Unverified | 0 |
| MoE-GPS: Guidlines for Prediction Strategy for Dynamic Expert Duplication in MoE Load Balancing | Jun 9, 2025 | GPUMixture-of-Experts | —Unverified | 0 |