SOTAVerified

Multiple-choice

Papers

Showing 151160 of 1107 papers

TitleStatusHype
ParallelPARC: A Scalable Pipeline for Generating Natural-Language AnalogiesCode1
NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese JournalismCode1
Benchmarking Large Language Models on Answering and Explaining Challenging Medical QuestionsCode1
NextLevelBERT: Masked Language Modeling with Higher-Level Representations for Long DocumentsCode1
Leveraging Large Language Models for Learning Complex Legal Concepts through StorytellingCode1
MoZIP: A Multilingual Benchmark to Evaluate Large Language Models in Intellectual PropertyCode1
Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language ModelsCode1
SportQA: A Benchmark for Sports Understanding in Large Language ModelsCode1
Uncertainty-Aware Evaluation for Vision-Language ModelsCode1
ArabicMMLU: Assessing Massive Multitask Language Understanding in ArabicCode1
Show:102550
← PrevPage 16 of 111Next →

No leaderboard results yet.