| SSR: Alignment-Aware Modality Connector for Speech Language Models | Sep 30, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Correctness Coverage Evaluation for Medical Multiple-Choice Question Answering Based on the Enhanced Conformal Prediction Framework | Mar 7, 2025 | Conformal PredictionMedical Question Answering | —Unverified | 0 | 0 |
| Step Guided Reasoning: Improving Mathematical Reasoning using Guidance Generation and Step Reasoning | Oct 18, 2024 | MathMathematical Reasoning | —Unverified | 0 | 0 |
| SuperBPE: Space Travel for Language Models | Mar 17, 2025 | Inductive BiasMMLU | —Unverified | 0 | 0 |
| Surface Fairness, Deep Bias: A Comparative Study of Bias in Language Models | Jun 12, 2025 | FairnessMMLU | —Unverified | 0 | 0 |
| SUTRA: Scalable Multilingual Language Model Architecture | May 7, 2024 | Computational EfficiencyHallucination | —Unverified | 0 | 0 |
| Swallowing the Poison Pills: Insights from Vulnerability Disparity Among LLMs | Feb 23, 2025 | Data PoisoningDiagnostic | —Unverified | 0 | 0 |
| Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning | Mar 7, 2025 | GPUMath | —Unverified | 0 | 0 |
| Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models | Oct 9, 2023 | MMLU | —Unverified | 0 | 0 |
| TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language Modeling Likewise | Oct 29, 2023 | Data AugmentationLanguage Modeling | —Unverified | 0 | 0 |
| The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback | Oct 31, 2023 | GSM8KMMLU | —Unverified | 0 | 0 |
| The Base-Rate Effect on LLM Benchmark Performance: Disambiguating Test-Taking Strategies from Benchmark Performance | Jun 17, 2024 | counterfactualMMLU | —Unverified | 0 | 0 |
| The Claude 3 Model Family: Opus, Sonnet, Haiku | Mar 4, 2024 | 1 Image, 2*2 StitchingArithmetic Reasoning | —Unverified | 0 | 0 |
| The Poison of Alignment | Aug 25, 2023 | MMLU | —Unverified | 0 | 0 |
| The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance? | Dec 2, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Uncovering Latent Chain of Thought Vectors in Language Models | Sep 21, 2024 | ARCGSM8K | —Unverified | 0 | 0 |
| Tokenization Standards for Linguistic Integrity: Turkish as a Benchmark | Feb 10, 2025 | MMLUMorphological Analysis | —Unverified | 0 | 0 |
| Towards Multilingual LLM Evaluation for European Languages | Oct 11, 2024 | ARCGSM8K | —Unverified | 0 | 0 |
| Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception | Feb 17, 2025 | MMLUNatural Questions | —Unverified | 0 | 0 |
| Towards Uncertainty-Aware Language Agent | Jan 25, 2024 | MMLUStrategyQA | —Unverified | 0 | 0 |
| Transcending Scaling Laws with 0.1% Extra Compute | Oct 20, 2022 | Arithmetic ReasoningCross-Lingual Question Answering | —Unverified | 0 | 0 |
| Transferable text data distillation by trajectory matching | Apr 14, 2025 | ARCLarge Language Model | —Unverified | 0 | 0 |
| Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests | Feb 20, 2025 | Logical ReasoningMMLU | —Unverified | 0 | 0 |
| Understanding Finetuning for Factual Knowledge Extraction | Jun 20, 2024 | MMLUQuestion Answering | —Unverified | 0 | 0 |
| Universality of Layer-Level Entropy-Weighted Quantization Beyond Model Architecture and Size | Mar 6, 2025 | MMLUQuantization | —Unverified | 0 | 0 |
| Unraveling Indirect In-Context Learning Using Influence Functions | Jan 1, 2025 | In-Context LearningInformativeness | —Unverified | 0 | 0 |
| Evaluating Mathematical Reasoning Across Large Language Models: A Fine-Grained Approach | Mar 13, 2025 | Formal LogicMathematical Reasoning | —Unverified | 0 | 0 |
| Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs | Dec 17, 2024 | MMLU | —Unverified | 0 | 0 |
| Upcycling Large Language Models into Mixture of Experts | Oct 10, 2024 | Mixture-of-ExpertsMMLU | —Unverified | 0 | 0 |
| Biomed-Enriched: A Biomedical Dataset Enriched with LLMs for Pretraining and Extracting Rare and Hidden Content | Jun 25, 2025 | ArticlesContinual Pretraining | —Unverified | 0 | 0 |
| Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination | Sep 19, 2024 | General KnowledgeMMLU | —Unverified | 0 | 0 |
| BrainTransformers: SNN-LLM | Oct 3, 2024 | ARCGSM8K | —Unverified | 0 | 0 |
| B-score: Detecting biases in large language models using response history | May 24, 2025 | MMLU | —Unverified | 0 | 0 |
| ChainRank-DPO: Chain Rank Direct Preference Optimization for LLM Rankers | Dec 18, 2024 | MMLUReranking | —Unverified | 0 | 0 |
| Changing Answer Order Can Decrease MMLU Accuracy | Jun 27, 2024 | MMLUMultiple-choice | —Unverified | 0 | 0 |
| MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation | Mar 13, 2025 | Language Model EvaluationLanguage Modeling | —Unverified | 0 | 0 |
| Context Reasoner: Incentivizing Reasoning Capability for Contextualized Privacy and Safety Compliance via Reinforcement Learning | May 20, 2025 | MMLUReinforcement Learning (RL) | —Unverified | 0 | 0 |
| Continuous Approximations for Improving Quantization Aware Training of LLMs | Oct 6, 2024 | MMLUModel Compression | —Unverified | 0 | 0 |
| Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks | Feb 24, 2025 | 2kARC | —Unverified | 0 | 0 |
| Cost-aware LLM-based Online Dataset Annotation | May 21, 2025 | MMLU | —Unverified | 0 | 0 |
| Cost-Effective Proxy Reward Model Construction with On-Policy and Active Learning | Jul 2, 2024 | Active LearningLanguage Modelling | —Unverified | 0 | 0 |
| Cost-Saving LLM Cascades with Early Abstention | Feb 13, 2025 | GSM8KMMLU | —Unverified | 0 | 0 |
| CPL: Critical Plan Step Learning Boosts LLM Generalization in Reasoning Tasks | Sep 13, 2024 | ARCCode Generation | —Unverified | 0 | 0 |
| Critique-Guided Distillation: Improving Supervised Fine-tuning via Better Distillation | May 16, 2025 | MathMMLU | —Unverified | 0 | 0 |
| Cultural Conditioning or Placebo? On the Effectiveness of Socio-Demographic Prompting | Jun 17, 2024 | EthicsMMLU | —Unverified | 0 | 0 |
| AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection | May 12, 2025 | GSM8KHumanEval | —Unverified | 0 | 0 |
| GenBFA: An Evolutionary Optimization Approach to Bit-Flip Attacks on LLMs | Nov 21, 2024 | MMLUText Generation | —Unverified | 0 | 0 |
| Data Efficient Evaluation of Large Language Models and Text-to-Image Models via Adaptive Sampling | Jun 21, 2024 | ClusteringMMLU | —Unverified | 0 | 0 |
| DEM: Distribution Edited Model for Training with Mixed Data Distributions | Jun 21, 2024 | DiversityInstruction Following | —Unverified | 0 | 0 |
| Detecting Benchmark Contamination Through Watermarking | Feb 24, 2025 | ARCMMLU | —Unverified | 0 | 0 |