SOTAVerified

Multiple-choice

Papers

Showing 151200 of 1107 papers

TitleStatusHype
ParallelPARC: A Scalable Pipeline for Generating Natural-Language AnalogiesCode1
NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese JournalismCode1
Benchmarking Large Language Models on Answering and Explaining Challenging Medical QuestionsCode1
NextLevelBERT: Masked Language Modeling with Higher-Level Representations for Long DocumentsCode1
MoZIP: A Multilingual Benchmark to Evaluate Large Language Models in Intellectual PropertyCode1
Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language ModelsCode1
Leveraging Large Language Models for Learning Complex Legal Concepts through StorytellingCode1
SportQA: A Benchmark for Sports Understanding in Large Language ModelsCode1
Uncertainty-Aware Evaluation for Vision-Language ModelsCode1
ArabicMMLU: Assessing Massive Multitask Language Understanding in ArabicCode1
BiMediX: Bilingual Medical Mixture of Experts LLMCode1
The Effect of Sampling Temperature on Problem Solving in Large Language ModelsCode1
SHIELD : An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language ModelsCode1
E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language ModelsCode1
LongHealth: A Question Answering Benchmark with Long Clinical DocumentsCode1
CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and ReasoningCode1
The Benefits of a Concise Chain of Thought on Problem-Solving in Large Language ModelsCode1
RoleEval: A Bilingual Role Evaluation Benchmark for Large Language ModelsCode1
HyKGE: A Hypothesis Knowledge Graph Enhanced Framework for Accurate and Reliable Medical LLMs ResponsesCode1
An In-depth Look at Gemini's Language AbilitiesCode1
Marathon: A Race Through the Realm of Long Context with Large Language ModelsCode1
Is Bigger and Deeper Always Better? Probing LLaMA Across Scales and LayersCode1
Fake Alignment: Are LLMs Really Aligned Well?Code1
Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language ModelsCode1
Resilient Multiple Choice Learning: A learned scoring scheme with application to audio scene analysisCode1
An Open Source Data Contamination Report for Large Language ModelsCode1
JMedLoRA:Medical Domain Adaptation on Japanese Large Language Models using Instruction-tuningCode1
OpsEval: A Comprehensive IT Operations Benchmark Suite for Large Language ModelsCode1
BRAINTEASER: Lateral Thinking Puzzles for Large Language ModelsCode1
LLM-Coordination: Evaluating and Analyzing Multi-agent Coordination Abilities in Large Language ModelsCode1
Fool Your (Vision and) Language Model With Embarrassingly Simple PermutationsCode1
Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model EvaluationCode1
Large Language Models Are Not Robust Multiple Choice SelectorsCode1
CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language ModelsCode1
LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language ModelsCode1
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering ModelsCode1
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language UnderstandingCode1
Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive FrameworkCode1
SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language ModelsCode1
Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge EvaluationCode1
Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam DatasetCode1
Conformal Prediction with Large Language Models for Multi-Choice Question AnsweringCode1
NarrativeXL: A Large-scale Dataset For Long-Term Memory ModelsCode1
VNHSGE: VietNamese High School Graduation Examination Dataset for Large Language ModelsCode1
M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark for Chinese Large Language ModelsCode1
Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought PromptingCode1
MindGames: Targeting Theory of Mind in Large Language Models with Dynamic Epistemic Modal LogicCode1
Evaluating GPT-3.5 and GPT-4 Models on Brazilian University Admission ExamsCode1
Explicit Planning Helps Language Models in Logical ReasoningCode1
Long Horizon Temperature ScalingCode1
Show:102550
← PrevPage 4 of 23Next →

No leaderboard results yet.