SOTAVerified

Multiple-choice

Papers

Showing 151175 of 1107 papers

TitleStatusHype
ParallelPARC: A Scalable Pipeline for Generating Natural-Language AnalogiesCode1
NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese JournalismCode1
Benchmarking Large Language Models on Answering and Explaining Challenging Medical QuestionsCode1
NextLevelBERT: Masked Language Modeling with Higher-Level Representations for Long DocumentsCode1
Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language ModelsCode1
MoZIP: A Multilingual Benchmark to Evaluate Large Language Models in Intellectual PropertyCode1
Leveraging Large Language Models for Learning Complex Legal Concepts through StorytellingCode1
SportQA: A Benchmark for Sports Understanding in Large Language ModelsCode1
Uncertainty-Aware Evaluation for Vision-Language ModelsCode1
BiMediX: Bilingual Medical Mixture of Experts LLMCode1
ArabicMMLU: Assessing Massive Multitask Language Understanding in ArabicCode1
The Effect of Sampling Temperature on Problem Solving in Large Language ModelsCode1
SHIELD : An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language ModelsCode1
E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language ModelsCode1
CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and ReasoningCode1
LongHealth: A Question Answering Benchmark with Long Clinical DocumentsCode1
The Benefits of a Concise Chain of Thought on Problem-Solving in Large Language ModelsCode1
HyKGE: A Hypothesis Knowledge Graph Enhanced Framework for Accurate and Reliable Medical LLMs ResponsesCode1
RoleEval: A Bilingual Role Evaluation Benchmark for Large Language ModelsCode1
An In-depth Look at Gemini's Language AbilitiesCode1
Marathon: A Race Through the Realm of Long Context with Large Language ModelsCode1
Is Bigger and Deeper Always Better? Probing LLaMA Across Scales and LayersCode1
Fake Alignment: Are LLMs Really Aligned Well?Code1
Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language ModelsCode1
Resilient Multiple Choice Learning: A learned scoring scheme with application to audio scene analysisCode1
Show:102550
← PrevPage 7 of 45Next →

No leaderboard results yet.