| Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination | Sep 19, 2024 | General KnowledgeMMLU | —Unverified | 0 | 0 |
| BrainTransformers: SNN-LLM | Oct 3, 2024 | ARCGSM8K | —Unverified | 0 | 0 |
| B-score: Detecting biases in large language models using response history | May 24, 2025 | MMLU | —Unverified | 0 | 0 |
| ChainRank-DPO: Chain Rank Direct Preference Optimization for LLM Rankers | Dec 18, 2024 | MMLUReranking | —Unverified | 0 | 0 |
| Changing Answer Order Can Decrease MMLU Accuracy | Jun 27, 2024 | MMLUMultiple-choice | —Unverified | 0 | 0 |
| MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation | Mar 13, 2025 | Language Model EvaluationLanguage Modeling | —Unverified | 0 | 0 |
| Context Reasoner: Incentivizing Reasoning Capability for Contextualized Privacy and Safety Compliance via Reinforcement Learning | May 20, 2025 | MMLUReinforcement Learning (RL) | —Unverified | 0 | 0 |
| Continuous Approximations for Improving Quantization Aware Training of LLMs | Oct 6, 2024 | MMLUModel Compression | —Unverified | 0 | 0 |
| Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks | Feb 24, 2025 | 2kARC | —Unverified | 0 | 0 |
| Cost-aware LLM-based Online Dataset Annotation | May 21, 2025 | MMLU | —Unverified | 0 | 0 |