SOTAVerified

Multiple-choice

Papers

Showing 611620 of 1107 papers

TitleStatusHype
VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment0
VCEval: Rethinking What is a Good Educational Video and How to Automatically Evaluate It0
Evaluating ChatGPT-4 Vision on Brazil's National Undergraduate Computer Science ExamCode0
DefAn: Definitive Answer Dataset for LLMs Hallucination EvaluationCode0
Bayesian Statistical Modeling with Predictors from LLMs0
AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models0
OLMES: A Standard for Language Model Evaluations0
BertaQA: How Much Do Language Models Know About Local Culture?Code0
Decision-Making Behavior Evaluation Framework for LLMs under Uncertain Context0
Towards a Personal Health Large Language Model0
Show:102550
← PrevPage 62 of 111Next →

No leaderboard results yet.