SOTAVerified

Multiple-choice

Papers

Showing 576600 of 1107 papers

TitleStatusHype
Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question?Code0
Uncertainty quantification in fine-tuned LLMs using LoRA ensemblesCode0
KMMLU: Measuring Massive Multitask Language Understanding in Korean0
Question-Instructed Visual Descriptions for Zero-Shot Video Question AnsweringCode0
DE-COP: Detecting Copyrighted Content in Language Models Training DataCode0
CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity KnowledgeCode2
The Effect of Sampling Temperature on Problem Solving in Large Language ModelsCode1
Prompting Implicit Discourse Relation Annotation0
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language ModelsCode2
SceMQA: A Scientific College Entrance Level Multimodal Question Answering Benchmark0
Are Machines Better at Complex Reasoning? Unveiling Human-Machine Inference Gaps in Entailment Verification0
SHIELD : An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language ModelsCode1
Enhancing textual textbook question answering with large language models and retrieval augmented generationCode0
LLMs May Perform MCQA by Selecting the Least Incorrect Option0
Distractor Generation in Multiple-Choice Tasks: A Survey of Methods, Datasets, and Evaluation0
When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model LeaderboardsCode0
An Information-Theoretic Approach to Analyze NLP Classification TasksCode0
I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBenchCode4
E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language ModelsCode1
Evaluating LLM -- Generated Multimodal Diagnosis from Medical Images and Symptom Analysis0
Improving Medical Reasoning through Retrieval and Self-Reflection with Retrieval-Augmented Large Language ModelsCode2
Towards Collective Superintelligence: Amplifying Group IQ using Conversational Swarms0
LongHealth: A Question Answering Benchmark with Long Clinical DocumentsCode1
CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and ReasoningCode1
What Large Language Models Know and What People Think They Know0
Show:102550
← PrevPage 24 of 45Next →

No leaderboard results yet.