| Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark | Apr 20, 2025 | MMLU | CodeCode Available | 1 |
| Towards Expert-Level Medical Question Answering with Large Language Models | May 16, 2023 | Medical Question AnsweringMedQA | CodeCode Available | 1 |
| Control LLM: Controlled Evolution for Intelligence Retention in LLM | Jan 19, 2025 | MathMathematical Reasoning | CodeCode Available | 1 |
| Tiny QA Benchmark++: Ultra-Lightweight, Synthetic Multilingual Dataset Generation & Smoke-Tests for Continuous LLM Evaluation | May 17, 2025 | Dataset GenerationGPU | CodeCode Available | 1 |
| Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models | Oct 8, 2023 | MMLUNatural Language Understanding | CodeCode Available | 1 |
| The Delta Learning Hypothesis: Preference Tuning on Weak Data can Yield Strong Gains | Jul 8, 2025 | MathMMLU | CodeCode Available | 1 |
| To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning | Sep 18, 2024 | MathMMLU | CodeCode Available | 1 |
| To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question Answering | Mar 4, 2024 | MedQAMMLU | CodeCode Available | 1 |
| Training Step-Level Reasoning Verifiers with Formal Verification Tools | May 21, 2025 | Formal LogicMath | CodeCode Available | 1 |
| TUMLU: A Unified and Native Language Understanding Benchmark for Turkic Languages | Feb 16, 2025 | Machine TranslationMMLU | CodeCode Available | 1 |
| Augmentation-Adapted Retriever Improves Generalization of Language Models as Generic Plug-In | May 27, 2023 | MMLURetrieval | CodeCode Available | 1 |
| UL2: Unifying Language Learning Paradigms | May 10, 2022 | Arithmetic ReasoningCommon Sense Reasoning | CodeCode Available | 1 |
| ComPEFT: Compression for Communicating Parameter Efficient Updates via Sparsification and Quantization | Nov 22, 2023 | GPULanguage Modelling | CodeCode Available | 1 |
| The Art of SOCRATIC QUESTIONING: Recursive Thinking with Large Language Models | May 24, 2023 | Language ModellingMath | CodeCode Available | 1 |
| The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale | Jun 25, 2024 | ARCLanguage Modeling | CodeCode Available | 1 |
| SiLVR: A Simple Language-based Video Reasoning Framework | May 30, 2025 | MathMME | CodeCode Available | 1 |
| CoMAT: Chain of Mathematically Annotated Thought Improves Mathematical Reasoning | Oct 14, 2024 | MathMathematical Reasoning | CodeCode Available | 1 |
| Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language Models | Oct 28, 2024 | Few-Shot LearningMMLU | CodeCode Available | 1 |
| Gemini: A Family of Highly Capable Multimodal Models | Dec 19, 2023 | 1 Image, 2*2 StitchingArithmetic Reasoning | CodeCode Available | 1 |
| Task-Circuit Quantization: Leveraging Knowledge Localization and Interpretability for Compression | Apr 10, 2025 | MathMMLU | CodeCode Available | 1 |
| Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment | Aug 18, 2023 | MMLURed Teaming | CodeCode Available | 1 |
| Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design | Oct 24, 2024 | Mixture-of-ExpertsMMLU | CodeCode Available | 1 |
| Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging | Jun 24, 2024 | MMLUModel Compression | CodeCode Available | 1 |
| From Zero to Hero: Examining the Power of Symbolic Tasks in Instruction Tuning | Apr 17, 2023 | MMLUZero-shot Generalization | CodeCode Available | 1 |
| A deeper look at depth pruning of LLMs | Jul 23, 2024 | MMLU | CodeCode Available | 1 |
| OwLore: Outlier-weighed Layerwise Sampled Low-Rank Projection for Memory-Efficient LLM Fine-tuning | May 28, 2024 | MMLU | CodeCode Available | 1 |
| LawInstruct: A Resource for Studying Language Model Adaptation to the Legal Domain | Apr 2, 2024 | Argument MiningDecision Making | CodeCode Available | 1 |
| Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark | Mar 26, 2025 | MMLUMultiple-choice | CodeCode Available | 1 |
| Bridging the Gap: Enhancing LLM Performance for Low-Resource African Languages with New Benchmarks, Fine-Tuning, and Cultural Adjustments | Dec 16, 2024 | Clinical KnowledgeCollege Medicine | CodeCode Available | 1 |
| Model-Generated Pretraining Signals Improves Zero-Shot Generalization of Text-to-Text Transformers | May 21, 2023 | MMLUZero-shot Generalization | CodeCode Available | 1 |
| ArcMMLU: A Library and Information Science Benchmark for Large Language Models | Nov 30, 2023 | MMLU | CodeCode Available | 1 |
| MyGO Multiplex CoT: A Method for Self-Reflection in Large Language Models via Double Chain of Thought Thinking | Jan 20, 2025 | Decision MakingGSM8K | CodeCode Available | 1 |
| Prompt Optimization via Adversarial In-Context Learning | Dec 5, 2023 | Arithmetic ReasoningData-to-Text Generation | CodeCode Available | 1 |
| ArabLegalEval: A Multitask Benchmark for Assessing Arabic Legal Knowledge in Large Language Models | Aug 15, 2024 | In-Context LearningMMLU | CodeCode Available | 1 |
| A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration | Oct 3, 2023 | Arithmetic ReasoningCode Generation | CodeCode Available | 1 |
| OpenBA: An Open-sourced 15B Bilingual Asymmetric seq2seq Model Pre-trained from Scratch | Sep 19, 2023 | BelebeleMMLU | CodeCode Available | 1 |
| MedCaseReasoning: Evaluating and learning diagnostic reasoning from clinical case reports | May 16, 2025 | DiagnosticMath | CodeCode Available | 1 |
| Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs | Jul 5, 2024 | General KnowledgeInstruction Following | CodeCode Available | 1 |
| LiveMind: Low-latency Large Language Models with Simultaneous Inference | Jun 20, 2024 | Collaborative InferenceLanguage Modeling | CodeCode Available | 1 |
| LM2: Large Memory Models | Feb 9, 2025 | DecoderMMLU | CodeCode Available | 1 |
| LOGO -- Long cOntext aliGnment via efficient preference Optimization | Oct 24, 2024 | GPULanguage Modeling | CodeCode Available | 1 |
| MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment | Oct 8, 2024 | ARCBelebele | CodeCode Available | 1 |
| HELM: Hyperbolic Large Language Models via Mixture-of-Curvature Experts | May 30, 2025 | ARCGeneral Knowledge | CodeCode Available | 1 |
| Efficient Online Data Mixing For Language Model Pre-Training | Dec 5, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| An Open Source Data Contamination Report for Large Language Models | Oct 26, 2023 | HellaSwagLanguage Modeling | CodeCode Available | 1 |
| Instruction Tuning With Loss Over Instructions | May 23, 2024 | HumanEvalMMLU | CodeCode Available | 1 |
| Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language Models | May 19, 2025 | BenchmarkingChatbot | CodeCode Available | 1 |
| HALO: Hierarchical Autonomous Logic-Oriented Orchestration for Multi-Agent LLM Systems | May 17, 2025 | Arithmetic ReasoningCode Generation | CodeCode Available | 1 |
| Large Language Models Encode Clinical Knowledge | Dec 26, 2022 | Clinical KnowledgeMedQA | CodeCode Available | 1 |
| Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models | Jun 23, 2024 | Machine TranslationMMLU | CodeCode Available | 1 |