| DataDecide: How to Predict Best Pretraining Data with Small Experiments | Apr 15, 2025 | ARCHellaSwag | CodeCode Available | 3 |
| REPLUG: Retrieval-Augmented Black-Box Language Models | Jan 30, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 3 |
| General-Reasoner: Advancing LLM Reasoning Across All Domains | May 20, 2025 | AllMath | CodeCode Available | 3 |
| LoLCATs: On Low-Rank Linearizing of Large Language Models | Oct 14, 2024 | MMLU | CodeCode Available | 3 |
| MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark | Jun 3, 2024 | MMLUMulti-task Language Understanding | CodeCode Available | 3 |
| What Matters in Transformers? Not All Attention is Needed | Jun 22, 2024 | AllMMLU | CodeCode Available | 2 |
| Accurate LoRA-Finetuning Quantization of LLMs via Information Retention | Feb 8, 2024 | MMLUQuantization | CodeCode Available | 2 |
| Routoo: Learning to Route to Large Language Models Effectively | Jan 25, 2024 | MMLUMulti-task Language Understanding | CodeCode Available | 2 |
| A StrongREJECT for Empty Jailbreaks | Feb 15, 2024 | MMLU | CodeCode Available | 2 |
| SOTOPIA-π: Interactive Learning of Socially Intelligent Language Agents | Mar 13, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| tinyBenchmarks: evaluating LLMs with fewer examples | Feb 22, 2024 | MMLUMultiple-choice | CodeCode Available | 2 |
| AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs | Apr 21, 2024 | MMLURed Teaming | CodeCode Available | 2 |
| Reinforcing General Reasoning without Verifiers | May 27, 2025 | MathMathematical Reasoning | CodeCode Available | 2 |
| Atlas: Few-shot Learning with Retrieval Augmented Language Models | Aug 5, 2022 | Fact CheckingFew-Shot Learning | CodeCode Available | 2 |
| Rethinking Benchmark and Contamination for Language Models with Rephrased Samples | Nov 8, 2023 | HumanEvalMMLU | CodeCode Available | 2 |
| Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models | Mar 28, 2025 | MMLUQuantization | CodeCode Available | 2 |
| Inheritune: Training Smaller Yet More Attentive Language Models | Apr 12, 2024 | DecoderLanguage Modelling | CodeCode Available | 2 |
| MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning | Nov 16, 2023 | MedQAMMLU | CodeCode Available | 2 |
| EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models | Dec 11, 2023 | BenchmarkingEmotional Intelligence | CodeCode Available | 2 |
| any4: Learned 4-bit Numeric Representation for LLMs | Jul 7, 2025 | GPUGSM8K | CodeCode Available | 2 |
| MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark | Dec 19, 2024 | MMLUMultiple-choice | CodeCode Available | 2 |
| Aurora:Activating Chinese chat capability for Mixtral-8x7B sparse Mixture-of-Experts through Instruction-Tuning | Dec 22, 2023 | Instruction FollowingMixture-of-Experts | CodeCode Available | 2 |
| Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization | Apr 8, 2025 | MathMathematical Reasoning | CodeCode Available | 2 |
| Augmentation-Adapted Retriever Improves Generalization of Language Models as Generic Plug-In | May 27, 2023 | MMLURetrieval | CodeCode Available | 1 |
| LiveMind: Low-latency Large Language Models with Simultaneous Inference | Jun 20, 2024 | Collaborative InferenceLanguage Modeling | CodeCode Available | 1 |