| ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools | Jun 18, 2024 | AllGSM8K | CodeCode Available | 14 | 5 |
| Qwen2 Technical Report | Jul 15, 2024 | Arithmetic ReasoningGSM8K | CodeCode Available | 13 | 5 |
| SCORE: Systematic COnsistency and Robustness Evaluation for Large Language Models | Feb 28, 2025 | MMLU | CodeCode Available | 11 | 5 |
| LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning | Mar 26, 2024 | GPUGSM8K | CodeCode Available | 9 | 5 |
| Yi: Open Foundation Models by 01.AI | Mar 7, 2024 | AttributeChatbot | CodeCode Available | 9 | 5 |
| Efficient multi-prompt evaluation of LLMs | May 27, 2024 | MMLU | CodeCode Available | 7 | 5 |
| DataComp-LM: In search of the next generation of training sets for language models | Jun 17, 2024 | Language ModellingMMLU | CodeCode Available | 7 | 5 |
| Qwen2.5-Omni Technical Report | Mar 26, 2025 | Automatic Speech Recognition (ASR)GSM8K | CodeCode Available | 7 | 5 |
| Revisiting MoE and Dense Speed-Accuracy Comparisons for LLM Training | May 23, 2024 | GSM8KMixture-of-Experts | CodeCode Available | 7 | 5 |
| ART: Automatic multi-step reasoning and tool-use for large language models | Mar 16, 2023 | MMLU | CodeCode Available | 6 | 5 |
| Training Compute-Optimal Large Language Models | Mar 29, 2022 | AnachronismsAnalogical Similarity | CodeCode Available | 6 | 5 |
| Make Your LLM Fully Utilize the Context | Apr 25, 2024 | 4kInformation Retrieval | CodeCode Available | 5 | 5 |
| Baichuan 2: Open Large-scale Language Models | Sep 19, 2023 | Feature EngineeringGSM8K | CodeCode Available | 4 | 5 |
| BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text | Mar 27, 2024 | ArticlesLanguage Modeling | CodeCode Available | 4 | 5 |
| Galactica: A Large Language Model for Science | Nov 16, 2022 | AnachronismsBias Detection | CodeCode Available | 4 | 5 |
| Improving Retrieval-Augmented Generation in Medicine with Iterative Follow-up Questions | Aug 1, 2024 | Medical Question AnsweringMedQA | CodeCode Available | 4 | 5 |
| YourBench: Easy Custom Evaluation Sets for Everyone | Apr 2, 2025 | MMLU | CodeCode Available | 3 | 5 |
| LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding | Apr 25, 2024 | GSM8KHellaSwag | CodeCode Available | 3 | 5 |
| Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory | Apr 10, 2025 | MathMMLU | CodeCode Available | 3 | 5 |
| ChatMusician: Understanding and Generating Music Intrinsically with LLM | Feb 25, 2024 | MMLUText Generation | CodeCode Available | 3 | 5 |
| HadaCore: Tensor Core Accelerated Hadamard Transform Kernel | Dec 12, 2024 | GPUMMLU | CodeCode Available | 3 | 5 |
| General-Reasoner: Advancing LLM Reasoning Across All Domains | May 20, 2025 | AllMath | CodeCode Available | 3 | 5 |
| ReasonIR: Training Retrievers for Reasoning Tasks | Apr 29, 2025 | Information RetrievalMMLU | CodeCode Available | 3 | 5 |
| Are We Done with MMLU? | Jun 6, 2024 | MMLUVirology | CodeCode Available | 3 | 5 |
| REPLUG: Retrieval-Augmented Black-Box Language Models | Jan 30, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 3 | 5 |
| Compact Language Models via Pruning and Knowledge Distillation | Jul 19, 2024 | Knowledge DistillationLanguage Modeling | CodeCode Available | 3 | 5 |
| DataDecide: How to Predict Best Pretraining Data with Small Experiments | Apr 15, 2025 | ARCHellaSwag | CodeCode Available | 3 | 5 |
| LoLCATs: On Low-Rank Linearizing of Large Language Models | Oct 14, 2024 | MMLU | CodeCode Available | 3 | 5 |
| Scaling Instruction-Finetuned Language Models | Oct 20, 2022 | Coreference ResolutionCross-Lingual Question Answering | CodeCode Available | 3 | 5 |
| MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark | Jun 3, 2024 | MMLUMulti-task Language Understanding | CodeCode Available | 3 | 5 |
| What Matters in Transformers? Not All Attention is Needed | Jun 22, 2024 | AllMMLU | CodeCode Available | 2 | 5 |
| Accurate LoRA-Finetuning Quantization of LLMs via Information Retention | Feb 8, 2024 | MMLUQuantization | CodeCode Available | 2 | 5 |
| A StrongREJECT for Empty Jailbreaks | Feb 15, 2024 | MMLU | CodeCode Available | 2 | 5 |
| Routoo: Learning to Route to Large Language Models Effectively | Jan 25, 2024 | MMLUMulti-task Language Understanding | CodeCode Available | 2 | 5 |
| AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs | Apr 21, 2024 | MMLURed Teaming | CodeCode Available | 2 | 5 |
| SOTOPIA-π: Interactive Learning of Socially Intelligent Language Agents | Mar 13, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 | 5 |
| tinyBenchmarks: evaluating LLMs with fewer examples | Feb 22, 2024 | MMLUMultiple-choice | CodeCode Available | 2 | 5 |
| Rethinking Benchmark and Contamination for Language Models with Rephrased Samples | Nov 8, 2023 | HumanEvalMMLU | CodeCode Available | 2 | 5 |
| Reinforcing General Reasoning without Verifiers | May 27, 2025 | MathMathematical Reasoning | CodeCode Available | 2 | 5 |
| Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization | Apr 8, 2025 | MathMathematical Reasoning | CodeCode Available | 2 | 5 |
| MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark | Dec 19, 2024 | MMLUMultiple-choice | CodeCode Available | 2 | 5 |
| any4: Learned 4-bit Numeric Representation for LLMs | Jul 7, 2025 | GPUGSM8K | CodeCode Available | 2 | 5 |
| MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning | Nov 16, 2023 | MedQAMMLU | CodeCode Available | 2 | 5 |
| Inheritune: Training Smaller Yet More Attentive Language Models | Apr 12, 2024 | DecoderLanguage Modelling | CodeCode Available | 2 | 5 |
| EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models | Dec 11, 2023 | BenchmarkingEmotional Intelligence | CodeCode Available | 2 | 5 |
| Aurora:Activating Chinese chat capability for Mixtral-8x7B sparse Mixture-of-Experts through Instruction-Tuning | Dec 22, 2023 | Instruction FollowingMixture-of-Experts | CodeCode Available | 2 | 5 |
| Atlas: Few-shot Learning with Retrieval Augmented Language Models | Aug 5, 2022 | Fact CheckingFew-Shot Learning | CodeCode Available | 2 | 5 |
| Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models | Mar 28, 2025 | MMLUQuantization | CodeCode Available | 2 | 5 |
| Augmentation-Adapted Retriever Improves Generalization of Language Models as Generic Plug-In | May 27, 2023 | MMLURetrieval | CodeCode Available | 1 | 5 |
| Efficient Online Data Mixing For Language Model Pre-Training | Dec 5, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |