| DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning | Jan 22, 2025 | Mathematical ReasoningMulti-task Language Understanding | CodeCode Available | 15 | 5 |
| Llama 2: Open Foundation and Fine-Tuned Chat Models | Jul 18, 2023 | Arithmetic Reasoning | CodeCode Available | 8 | 5 |
| LLaMA: Open and Efficient Foundation Language Models | Feb 27, 2023 | Arithmetic ReasoningCode Generation | CodeCode Available | 7 | 5 |
| Mistral 7B | Oct 10, 2023 | answerability predictionArithmetic Reasoning | CodeCode Available | 6 | 5 |
| GPT-4 Technical Report | Mar 15, 2023 | answerability predictionArithmetic Reasoning | CodeCode Available | 6 | 5 |
| GLM-130B: An Open Bilingual Pre-trained Model | Oct 5, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 6 | 5 |
| Training Compute-Optimal Large Language Models | Mar 29, 2022 | AnachronismsAnalogical Similarity | CodeCode Available | 6 | 5 |
| The Llama 3 Herd of Models | Jul 31, 2024 | answerability predictionLanguage Modeling | CodeCode Available | 4 | 5 |
| Mixtral of Experts | Jan 8, 2024 | Code GenerationCommon Sense Reasoning | CodeCode Available | 4 | 5 |
| Galactica: A Large Language Model for Science | Nov 16, 2022 | AnachronismsBias Detection | CodeCode Available | 4 | 5 |
| REPLUG: Retrieval-Augmented Black-Box Language Models | Jan 30, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 3 | 5 |
| Evaluating Large Language Models Trained on Code | Jul 7, 2021 | Code GenerationHumanEval | CodeCode Available | 3 | 5 |
| MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark | Jun 3, 2024 | MMLUMulti-task Language Understanding | CodeCode Available | 3 | 5 |
| Language Models are Few-Shot Learners | May 28, 2020 | answerability predictionArticles | CodeCode Available | 3 | 5 |
| Scaling Instruction-Finetuned Language Models | Oct 20, 2022 | Coreference ResolutionCross-Lingual Question Answering | CodeCode Available | 3 | 5 |
| MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark | Dec 19, 2024 | MMLUMultiple-choice | CodeCode Available | 2 | 5 |
| Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for Ensembling | Jun 18, 2024 | Arithmetic ReasoningLanguage Modeling | CodeCode Available | 2 | 5 |
| Scaling Language Models: Methods, Analysis & Insights from Training Gopher | Dec 8, 2021 | Abstract AlgebraAnachronisms | CodeCode Available | 2 | 5 |
| PaLM: Scaling Language Modeling with Pathways | Apr 5, 2022 | Auto DebuggingCode Generation | CodeCode Available | 2 | 5 |
| Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks | Jan 5, 2024 | Arithmetic ReasoningCode Generation | CodeCode Available | 2 | 5 |
| Measuring Massive Multitask Language Understanding | Sep 7, 2020 | Elementary MathematicsMulti-task Language Understanding | CodeCode Available | 2 | 5 |
| Solving Quantitative Reasoning Problems with Language Models | Jun 29, 2022 | Arithmetic ReasoningLanguage Modeling | CodeCode Available | 2 | 5 |
| Routoo: Learning to Route to Large Language Models Effectively | Jan 25, 2024 | MMLUMulti-task Language Understanding | CodeCode Available | 2 | 5 |
| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations | Sep 26, 2019 | Common Sense ReasoningGPU | CodeCode Available | 2 | 5 |
| Atlas: Few-shot Learning with Retrieval Augmented Language Models | Aug 5, 2022 | Fact CheckingFew-Shot Learning | CodeCode Available | 2 | 5 |
| UL2: Unifying Language Learning Paradigms | May 10, 2022 | Arithmetic ReasoningCommon Sense Reasoning | CodeCode Available | 1 | 5 |
| ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic | Feb 20, 2024 | ArabicMMLULanguage Model Evaluation | CodeCode Available | 1 | 5 |
| Are Human-generated Demonstrations Necessary for In-context Learning? | Sep 26, 2023 | Arithmetic ReasoningCode Generation | CodeCode Available | 1 | 5 |
| Gemini: A Family of Highly Capable Multimodal Models | Dec 19, 2023 | 1 Image, 2*2 StitchingArithmetic Reasoning | CodeCode Available | 1 | 5 |
| GPT-NeoX-20B: An Open-Source Autoregressive Language Model | Apr 14, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles | Jun 18, 2024 | Arithmetic ReasoningCode Generation | CodeCode Available | 1 | 5 |
| MiLe Loss: a New Loss for Mitigating the Bias of Learning Difficulties in Generative Language Models | Oct 30, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Language Models are Unsupervised Multitask Learners | Feb 14, 2019 | Common Sense ReasoningCoreference Resolution | CodeCode Available | 1 | 5 |
| Large Language Models Only Pass Primary School Exams in Indonesia: A Comprehensive Test on IndoMMLU | Oct 7, 2023 | Multi-task Language UnderstandingWorld Knowledge | CodeCode Available | 1 | 5 |
| Merging Models with Fisher-Weighted Averaging | Nov 18, 2021 | Domain AdaptationMulti-task Language Understanding | CodeCode Available | 1 | 5 |
| RoBERTa: A Robustly Optimized BERT Pretraining Approach | Jul 26, 2019 | Common Sense ReasoningDocument Image Classification | CodeCode Available | 1 | 5 |
| TUMLU: A Unified and Native Language Understanding Benchmark for Turkic Languages | Feb 16, 2025 | Machine TranslationMMLU | CodeCode Available | 1 | 5 |
| UnifiedQA: Crossing Format Boundaries With a Single QA System | May 2, 2020 | Common Sense ReasoningLanguage Modeling | CodeCode Available | 1 | 5 |
| MERGE: Fast Private Text Generation | May 25, 2023 | Code CompletionMulti-task Language Understanding | CodeCode Available | 0 | 5 |
| Llama 3 Meets MoE: Efficient Upcycling | Dec 13, 2024 | Mixture-of-ExpertsMMLU | CodeCode Available | 0 | 5 |
| BloombergGPT: A Large Language Model for Finance | Mar 30, 2023 | Causal JudgmentCommon Sense Reasoning | CodeCode Available | 0 | 5 |
| Textbooks Are All You Need II: phi-1.5 technical report | Sep 11, 2023 | AllCode Generation | CodeCode Available | 0 | 5 |
| Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM | Mar 12, 2024 | Arithmetic ReasoningCode Generation | CodeCode Available | 0 | 5 |
| PaLM 2 Technical Report | May 17, 2023 | Code GenerationCommon Sense Reasoning | CodeCode Available | 0 | 5 |
| Let's Do a Thought Experiment: Using Counterfactuals to Improve Moral Reasoning | Jun 25, 2023 | counterfactualMath | —Unverified | 0 | 0 |
| IndicMMLU-Pro: Benchmarking Indic Large Language Models on Multi-Task Language Understanding | Jan 27, 2025 | BenchmarkingDiversity | —Unverified | 0 | 0 |
| GPT-4o as the Gold Standard: A Scalable and General Purpose Approach to Filter Language Model Pretraining Data | Oct 3, 2024 | Active LearningLanguage Modeling | —Unverified | 0 | 0 |
| Effectiveness of Zero-shot-CoT in Japanese Prompts | Mar 9, 2025 | Abstract AlgebraCollege Mathematics | —Unverified | 0 | 0 |
| Model Card and Evaluations for Claude Models | Jul 11, 2023 | Arithmetic ReasoningBug fixing | —Unverified | 0 | 0 |
| The Claude 3 Model Family: Opus, Sonnet, Haiku | Mar 4, 2024 | 1 Image, 2*2 StitchingArithmetic Reasoning | —Unverified | 0 | 0 |