SOTAVerified

Multiple-choice

Papers

Showing 551575 of 1107 papers

TitleStatusHype
Controlling Cloze-test Question Item Difficulty with PLM-based Surrogate Models for IRT Assessment0
ParallelPARC: A Scalable Pipeline for Generating Natural-Language AnalogiesCode1
Predictions from language models for multiple-choice tasks are not robust under variation of scoring methods0
NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese JournalismCode1
Benchmarking Large Language Models on Answering and Explaining Challenging Medical QuestionsCode1
NextLevelBERT: Masked Language Modeling with Higher-Level Representations for Long DocumentsCode1
Unsupervised multiple choices question answering via universal corpus0
Leveraging Large Language Models for Learning Complex Legal Concepts through StorytellingCode1
Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language ModelsCode1
MoZIP: A Multilingual Benchmark to Evaluate Large Language Models in Intellectual PropertyCode1
SportQA: A Benchmark for Sports Understanding in Large Language ModelsCode1
Biomedical Entity Linking as Multiple Choice Question AnsweringCode0
ToMBench: Benchmarking Theory of Mind in Large Language ModelsCode2
tinyBenchmarks: evaluating LLMs with fewer examplesCode2
Uncertainty-Aware Evaluation for Vision-Language ModelsCode1
Identifying Multiple Personalities in Large Language Models with External Evaluation0
"My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language ModelsCode0
Ranking Large Language Models without Ground Truth0
Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models0
KorNAT: LLM Alignment Benchmark for Korean Social Values and Common Knowledge0
Digital Comprehensibility Assessment of Simplified Texts among Persons with Intellectual Disabilities0
BiMediX: Bilingual Medical Mixture of Experts LLMCode1
ArabicMMLU: Assessing Massive Multitask Language Understanding in ArabicCode1
Probabilities of Chat LLMs Are Miscalibrated but Still Predict Correctness on Multiple-Choice Q&ACode0
Stick to your Role! Stability of Personal Values Expressed in Large Language Models0
Show:102550
← PrevPage 23 of 45Next →

No leaderboard results yet.