SOTAVerified

Multi-task Language Understanding

The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. https://arxiv.org/pdf/2009.03300.pdf

Papers

Showing 2650 of 57 papers

TitleStatusHype
UL2: Unifying Language Learning ParadigmsCode1
ArabicMMLU: Assessing Massive Multitask Language Understanding in ArabicCode1
Are Human-generated Demonstrations Necessary for In-context Learning?Code1
Gemini: A Family of Highly Capable Multimodal ModelsCode1
GPT-NeoX-20B: An Open-Source Autoregressive Language ModelCode1
Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive PrinciplesCode1
MiLe Loss: a New Loss for Mitigating the Bias of Learning Difficulties in Generative Language ModelsCode1
Language Models are Unsupervised Multitask LearnersCode1
Large Language Models Only Pass Primary School Exams in Indonesia: A Comprehensive Test on IndoMMLUCode1
Merging Models with Fisher-Weighted AveragingCode1
RoBERTa: A Robustly Optimized BERT Pretraining ApproachCode1
TUMLU: A Unified and Native Language Understanding Benchmark for Turkic LanguagesCode1
UnifiedQA: Crossing Format Boundaries With a Single QA SystemCode1
Claude 3.5 Sonnet Model Card Addendum0
Measuring Hong Kong Massive Multi-Task Language Understanding0
Transcending Scaling Laws with 0.1% Extra Compute0
Model Card and Evaluations for Claude Models0
Orca 2: Teaching Small Language Models How to Reason0
Reasoning Beyond Bias: A Study on Counterfactual Prompting and Chain of Thought Reasoning0
Let's Do a Thought Experiment: Using Counterfactuals to Improve Moral Reasoning0
IndicMMLU-Pro: Benchmarking Indic Large Language Models on Multi-Task Language Understanding0
GPT-4o as the Gold Standard: A Scalable and General Purpose Approach to Filter Language Model Pretraining Data0
Effectiveness of Zero-shot-CoT in Japanese Prompts0
MMLU-SR: A Benchmark for Stress-Testing Reasoning Capability of Large Language Models0
The Claude 3 Model Family: Opus, Sonnet, Haiku0
Show:102550
← PrevPage 2 of 3Next →

No leaderboard results yet.