SOTAVerified

Multiple-choice

Papers

Showing 181190 of 1107 papers

TitleStatusHype
LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician ValidationCode1
BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and LanguagesCode1
A Few More Examples May Be Worth Billions of ParametersCode1
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model EvaluationCode1
Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive FrameworkCode1
Long Horizon Temperature ScalingCode1
Evaluating language models as risk scoresCode1
BRAINTEASER: Lateral Thinking Puzzles for Large Language ModelsCode1
EduQG: A Multi-format Multiple Choice Dataset for the Educational DomainCode1
E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language ModelsCode1
Show:102550
← PrevPage 19 of 111Next →

No leaderboard results yet.