SOTAVerified

Multiple-choice

Papers

Showing 201225 of 1107 papers

TitleStatusHype
Can large language models reason about medical questions?Code1
A Few More Examples May Be Worth Billions of ParametersCode1
ArabicMMLU: Assessing Massive Multitask Language Understanding in ArabicCode1
NextLevelBERT: Masked Language Modeling with Higher-Level Representations for Long DocumentsCode1
Large Language Models Are Not Robust Multiple Choice SelectorsCode1
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model EvaluationCode1
Explaining NLP Models via Minimal Contrastive Editing (MiCE)Code1
Fake Alignment: Are LLMs Really Aligned Well?Code1
LLM-Coordination: Evaluating and Analyzing Multi-agent Coordination Abilities in Large Language ModelsCode1
Evaluating the Knowledge Dependency of QuestionsCode1
A BERT-based Distractor Generation Scheme with Multi-tasking and Negative Answer Training Strategies.Code1
AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language ModelsCode1
Clues Before Answers: Generation-Enhanced Multiple-Choice QACode1
Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language ModelsCode1
ExplaGraphs: An Explanation Graph Generation Task for Structured Commonsense ReasoningCode1
FarsTail: A Persian Natural Language Inference DatasetCode1
ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question AnsweringCode1
Resilient Multiple Choice Learning: A learned scoring scheme with application to audio scene analysisCode1
Enhancing Knowledge Tracing with Concept Map and Response DisentanglementCode1
Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive FrameworkCode1
Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model EvaluationCode1
SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure InterpretationCode1
E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language ModelsCode1
EduQG: A Multi-format Multiple Choice Dataset for the Educational DomainCode1
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language UnderstandingCode1
Show:102550
← PrevPage 9 of 45Next →

No leaderboard results yet.