SOTAVerified

Multiple-choice

Papers

Showing 176200 of 1107 papers

TitleStatusHype
Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language ModelsCode1
BiMediX: Bilingual Medical Mixture of Experts LLMCode1
E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language ModelsCode1
Latxa: An Open Language Model and Evaluation Suite for BasqueCode1
Let Androids Dream of Electric Sheep: A Human-like Image Implication Understanding and Reasoning FrameworkCode1
LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician ValidationCode1
BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and LanguagesCode1
JMedLoRA:Medical Domain Adaptation on Japanese Large Language Models using Instruction-tuningCode1
A Few More Examples May Be Worth Billions of ParametersCode1
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model EvaluationCode1
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language ModelsCode1
Boosting Healthcare LLMs Through Retrieved ContextCode1
BRAINTEASER: Lateral Thinking Puzzles for Large Language ModelsCode1
Clues Before Answers: Generation-Enhanced Multiple-Choice QACode1
Bridging Video-text Retrieval with Multiple Choice QuestionsCode1
Knowledge Graph-Augmented Abstractive Summarization with Semantic-Driven Cloze RewardCode1
ExplaGraphs: An Explanation Graph Generation Task for Structured Commonsense ReasoningCode1
IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Language Models in E-commerceCode1
Multiple-Choice Questions are Efficient and Robust LLM EvaluatorsCode1
Explaining NLP Models via Minimal Contrastive Editing (MiCE)Code1
A BERT-based Distractor Generation Scheme with Multi-tasking and Negative Answer Training Strategies.Code1
NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese JournalismCode1
Explicit Planning Helps Language Models in Logical ReasoningCode1
AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language ModelsCode1
IRLBench: A Multi-modal, Culturally Grounded, Parallel Irish-English Benchmark for Open-Ended LLM Reasoning EvaluationCode1
Show:102550
← PrevPage 8 of 45Next →

No leaderboard results yet.