SOTAVerified

Multiple-choice

Papers

Showing 551600 of 1107 papers

TitleStatusHype
Controlling Cloze-test Question Item Difficulty with PLM-based Surrogate Models for IRT Assessment0
ParallelPARC: A Scalable Pipeline for Generating Natural-Language AnalogiesCode1
Predictions from language models for multiple-choice tasks are not robust under variation of scoring methods0
NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese JournalismCode1
Benchmarking Large Language Models on Answering and Explaining Challenging Medical QuestionsCode1
NextLevelBERT: Masked Language Modeling with Higher-Level Representations for Long DocumentsCode1
Unsupervised multiple choices question answering via universal corpus0
Leveraging Large Language Models for Learning Complex Legal Concepts through StorytellingCode1
Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language ModelsCode1
MoZIP: A Multilingual Benchmark to Evaluate Large Language Models in Intellectual PropertyCode1
SportQA: A Benchmark for Sports Understanding in Large Language ModelsCode1
Biomedical Entity Linking as Multiple Choice Question AnsweringCode0
ToMBench: Benchmarking Theory of Mind in Large Language ModelsCode2
tinyBenchmarks: evaluating LLMs with fewer examplesCode2
Uncertainty-Aware Evaluation for Vision-Language ModelsCode1
Identifying Multiple Personalities in Large Language Models with External Evaluation0
"My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language ModelsCode0
Ranking Large Language Models without Ground Truth0
Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models0
KorNAT: LLM Alignment Benchmark for Korean Social Values and Common Knowledge0
Digital Comprehensibility Assessment of Simplified Texts among Persons with Intellectual Disabilities0
BiMediX: Bilingual Medical Mixture of Experts LLMCode1
ArabicMMLU: Assessing Massive Multitask Language Understanding in ArabicCode1
Probabilities of Chat LLMs Are Miscalibrated but Still Predict Correctness on Multiple-Choice Q&ACode0
Stick to your Role! Stability of Personal Values Expressed in Large Language Models0
Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question?Code0
Uncertainty quantification in fine-tuned LLMs using LoRA ensemblesCode0
KMMLU: Measuring Massive Multitask Language Understanding in Korean0
Question-Instructed Visual Descriptions for Zero-Shot Video Question AnsweringCode0
DE-COP: Detecting Copyrighted Content in Language Models Training DataCode0
CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity KnowledgeCode2
The Effect of Sampling Temperature on Problem Solving in Large Language ModelsCode1
Prompting Implicit Discourse Relation Annotation0
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language ModelsCode2
SceMQA: A Scientific College Entrance Level Multimodal Question Answering Benchmark0
Are Machines Better at Complex Reasoning? Unveiling Human-Machine Inference Gaps in Entailment Verification0
SHIELD : An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language ModelsCode1
Enhancing textual textbook question answering with large language models and retrieval augmented generationCode0
LLMs May Perform MCQA by Selecting the Least Incorrect Option0
Distractor Generation in Multiple-Choice Tasks: A Survey of Methods, Datasets, and Evaluation0
When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model LeaderboardsCode0
An Information-Theoretic Approach to Analyze NLP Classification TasksCode0
I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBenchCode4
E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language ModelsCode1
Evaluating LLM -- Generated Multimodal Diagnosis from Medical Images and Symptom Analysis0
Improving Medical Reasoning through Retrieval and Self-Reflection with Retrieval-Augmented Large Language ModelsCode2
Towards Collective Superintelligence: Amplifying Group IQ using Conversational Swarms0
LongHealth: A Question Answering Benchmark with Long Clinical DocumentsCode1
CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and ReasoningCode1
What Large Language Models Know and What People Think They Know0
Show:102550
← PrevPage 12 of 23Next →

No leaderboard results yet.