| ConceptPsy:A Benchmark Suite with Conceptual Comprehensiveness in Psychology | Nov 16, 2023 | MMLUMultiple-choice | —Unverified | 0 |
| Quantifying Variance in Evaluation Benchmarks | Jun 14, 2024 | MMLU | —Unverified | 0 |
| ALLaM: Large Language Models for Arabic and English | Jul 22, 2024 | DecoderLanguage Acquisition | —Unverified | 0 |
| AgentInstruct: Toward Generative Teaching with Agentic Flows | Jul 3, 2024 | GSM8KMMLU | —Unverified | 0 |
| Reactor Mk.1 performances: MMLU, HumanEval and BBH test results | Jun 15, 2024 | BenchmarkingHumanEval | —Unverified | 0 |
| Real-time Adapting Routing (RAR): Improving Efficiency Through Continuous Learning in Software Powered by Layered Foundation Models | Nov 14, 2024 | Domain GeneralizationIn-Context Learning | —Unverified | 0 |
| Reasoning Beyond Bias: A Study on Counterfactual Prompting and Chain of Thought Reasoning | Aug 16, 2024 | counterfactualMMLU | —Unverified | 0 |
| MMLU-SR: A Benchmark for Stress-Testing Reasoning Capability of Large Language Models | Jun 15, 2024 | Mathematical ReasoningMMLU | —Unverified | 0 |
| Reasoning Paths Optimization: Learning to Reason and Explore From Diverse Paths | Oct 7, 2024 | AttributeGSM8K | —Unverified | 0 |
| Reasoning Robustness of LLMs to Adversarial Typographical Errors | Nov 8, 2024 | GSM8KMMLU | —Unverified | 0 |
| Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training | Jul 7, 2025 | General KnowledgeMMLU | —Unverified | 0 |
| Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models | Apr 18, 2024 | GSM8KMMLU | —Unverified | 0 |
| MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs | Sep 3, 2024 | MMLU | CodeCode Available | 0 |
| Warm Up Before You Train: Unlocking General Reasoning in Resource-Constrained Settings | May 19, 2025 | HumanEvalMath | CodeCode Available | 0 |
| ChatBench: From Static Benchmarks to Human-AI Evaluation | Mar 22, 2025 | MathMMLU | CodeCode Available | 0 |
| Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation | Jun 20, 2024 | GSM8KLanguage Model Evaluation | CodeCode Available | 0 |
| ARL2: Aligning Retrievers for Black-box Large Language Models via Self-guided Adaptive Relevance Labeling | Feb 21, 2024 | MMLURetrieval | CodeCode Available | 0 |
| Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective | Feb 20, 2025 | GSM8KMath | CodeCode Available | 0 |
| When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards | Feb 1, 2024 | Answer SelectionLanguage Modeling | CodeCode Available | 0 |
| MATHSENSEI: A Tool-Augmented Large Language Model for Mathematical Reasoning | Feb 27, 2024 | 8kLanguage Modeling | CodeCode Available | 0 |
| DFPE: A Diverse Fingerprint Ensemble for Enhancing LLM Performance | Jan 29, 2025 | DiversityMMLU | CodeCode Available | 0 |
| BnMMLU: Measuring Massive Multitask Language Understanding in Bengali | May 25, 2025 | General KnowledgeMMLU | CodeCode Available | 0 |
| Review-Instruct: A Review-Driven Multi-Turn Conversations Generation Method for Large Language Models | May 16, 2025 | DiversityMMLU | CodeCode Available | 0 |
| Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models | Dec 2, 2024 | MMLUMultiple-choice | CodeCode Available | 0 |
| Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models | Sep 27, 2023 | HumanEvalLanguage Modeling | CodeCode Available | 0 |