SOTAVerified

Multi-task Language Understanding

The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. https://arxiv.org/pdf/2009.03300.pdf

Papers

Showing 150 of 57 papers

TitleStatusHype
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement LearningCode15
Llama 2: Open Foundation and Fine-Tuned Chat ModelsCode8
LLaMA: Open and Efficient Foundation Language ModelsCode7
Mistral 7BCode6
GPT-4 Technical ReportCode6
GLM-130B: An Open Bilingual Pre-trained ModelCode6
Training Compute-Optimal Large Language ModelsCode6
The Llama 3 Herd of ModelsCode4
Mixtral of ExpertsCode4
Galactica: A Large Language Model for ScienceCode4
REPLUG: Retrieval-Augmented Black-Box Language ModelsCode3
Evaluating Large Language Models Trained on CodeCode3
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding BenchmarkCode3
Language Models are Few-Shot LearnersCode3
Scaling Instruction-Finetuned Language ModelsCode3
MMLU-CF: A Contamination-free Multi-task Language Understanding BenchmarkCode2
Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for EnsemblingCode2
Scaling Language Models: Methods, Analysis & Insights from Training GopherCode2
PaLM: Scaling Language Modeling with PathwaysCode2
Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General TasksCode2
Measuring Massive Multitask Language UnderstandingCode2
Solving Quantitative Reasoning Problems with Language ModelsCode2
Routoo: Learning to Route to Large Language Models EffectivelyCode2
ALBERT: A Lite BERT for Self-supervised Learning of Language RepresentationsCode2
Atlas: Few-shot Learning with Retrieval Augmented Language ModelsCode2
UL2: Unifying Language Learning ParadigmsCode1
ArabicMMLU: Assessing Massive Multitask Language Understanding in ArabicCode1
Are Human-generated Demonstrations Necessary for In-context Learning?Code1
Gemini: A Family of Highly Capable Multimodal ModelsCode1
GPT-NeoX-20B: An Open-Source Autoregressive Language ModelCode1
Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive PrinciplesCode1
MiLe Loss: a New Loss for Mitigating the Bias of Learning Difficulties in Generative Language ModelsCode1
Language Models are Unsupervised Multitask LearnersCode1
Large Language Models Only Pass Primary School Exams in Indonesia: A Comprehensive Test on IndoMMLUCode1
Merging Models with Fisher-Weighted AveragingCode1
RoBERTa: A Robustly Optimized BERT Pretraining ApproachCode1
TUMLU: A Unified and Native Language Understanding Benchmark for Turkic LanguagesCode1
UnifiedQA: Crossing Format Boundaries With a Single QA SystemCode1
MERGE: Fast Private Text GenerationCode0
Llama 3 Meets MoE: Efficient UpcyclingCode0
BloombergGPT: A Large Language Model for FinanceCode0
Textbooks Are All You Need II: phi-1.5 technical reportCode0
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLMCode0
PaLM 2 Technical ReportCode0
Let's Do a Thought Experiment: Using Counterfactuals to Improve Moral Reasoning0
IndicMMLU-Pro: Benchmarking Indic Large Language Models on Multi-Task Language Understanding0
GPT-4o as the Gold Standard: A Scalable and General Purpose Approach to Filter Language Model Pretraining Data0
Effectiveness of Zero-shot-CoT in Japanese Prompts0
Model Card and Evaluations for Claude Models0
The Claude 3 Model Family: Opus, Sonnet, Haiku0
Show:102550
← PrevPage 1 of 2Next →

No leaderboard results yet.