SOTAVerified

Multiple-choice

Papers

Showing 581590 of 1107 papers

TitleStatusHype
CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity KnowledgeCode2
The Effect of Sampling Temperature on Problem Solving in Large Language ModelsCode1
Prompting Implicit Discourse Relation Annotation0
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language ModelsCode2
SceMQA: A Scientific College Entrance Level Multimodal Question Answering Benchmark0
Are Machines Better at Complex Reasoning? Unveiling Human-Machine Inference Gaps in Entailment Verification0
SHIELD : An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language ModelsCode1
Enhancing textual textbook question answering with large language models and retrieval augmented generationCode0
LLMs May Perform MCQA by Selecting the Least Incorrect Option0
Distractor Generation in Multiple-Choice Tasks: A Survey of Methods, Datasets, and Evaluation0
Show:102550
← PrevPage 59 of 111Next →

No leaderboard results yet.