| HALO: Hierarchical Autonomous Logic-Oriented Orchestration for Multi-Agent LLM Systems | May 17, 2025 | Arithmetic ReasoningCode Generation | CodeCode Available | 1 |
| From Zero to Hero: Examining the Power of Symbolic Tasks in Instruction Tuning | Apr 17, 2023 | MMLUZero-shot Generalization | CodeCode Available | 1 |
| An Open Source Data Contamination Report for Large Language Models | Oct 26, 2023 | HellaSwagLanguage Modeling | CodeCode Available | 1 |
| HELM: Hyperbolic Large Language Models via Mixture-of-Curvature Experts | May 30, 2025 | ARCGeneral Knowledge | CodeCode Available | 1 |
| Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language Models | May 19, 2025 | BenchmarkingChatbot | CodeCode Available | 1 |
| Instruction Tuning With Loss Over Instructions | May 23, 2024 | HumanEvalMMLU | CodeCode Available | 1 |
| Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models | Jun 23, 2024 | Machine TranslationMMLU | CodeCode Available | 1 |
| Gemini: A Family of Highly Capable Multimodal Models | Dec 19, 2023 | 1 Image, 2*2 StitchingArithmetic Reasoning | CodeCode Available | 1 |
| Large Language Models Encode Clinical Knowledge | Dec 26, 2022 | Clinical KnowledgeMedQA | CodeCode Available | 1 |
| MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment | Oct 8, 2024 | ARCBelebele | CodeCode Available | 1 |