SOTAVerified

Multiple-choice

Papers

Showing 601625 of 1107 papers

TitleStatusHype
QRMeM: Unleash the Length Limitation through Question then Reflection Memory Mechanism0
Enhancing Distractor Generation for Multiple-Choice Questions with Retrieval Augmented Pretraining and Knowledge Graph Integration0
On the Principles behind Opinion Dynamics in Multi-Agent Systems of Large Language Models0
Aqulia-Med LLM: Pioneering Full-Process Open-Source Medical Language Models0
QOG:Question and Options Generation based on Language Model0
UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice QuestionsCode0
DetectBench: Can Large Language Model Detect and Piece Together Implicit Evidence?Code0
IPEval: A Bilingual Intellectual Property Agency Consultation Evaluation Benchmark for Large Language ModelsCode0
Grade Score: Quantifying LLM Performance in Option SelectionCode0
Balancing Rigor and Utility: Mitigating Cognitive Biases in Large Language Models for Multiple-Choice QuestionsCode0
VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment0
VCEval: Rethinking What is a Good Educational Video and How to Automatically Evaluate It0
Evaluating ChatGPT-4 Vision on Brazil's National Undergraduate Computer Science ExamCode0
DefAn: Definitive Answer Dataset for LLMs Hallucination EvaluationCode0
Bayesian Statistical Modeling with Predictors from LLMs0
AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models0
OLMES: A Standard for Language Model Evaluations0
BertaQA: How Much Do Language Models Know About Local Culture?Code0
Decision-Making Behavior Evaluation Framework for LLMs under Uncertain Context0
Towards a Personal Health Large Language Model0
Investigating and Addressing Hallucinations of LLMs in Tasks Involving Negation0
Do LLMs Recognize me, When I is not me: Assessment of LLMs Understanding of Turkish Indexical Pronouns in Indexical Shift Contexts0
CRiskEval: A Chinese Multi-Level Risk Evaluation Benchmark Dataset for Large Language ModelsCode0
LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMsCode0
Every Answer Matters: Evaluating Commonsense with Probabilistic MeasuresCode0
Show:102550
← PrevPage 25 of 45Next →

No leaderboard results yet.