SOTAVerified

Multiple-choice

Papers

Showing 441450 of 1107 papers

TitleStatusHype
VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment0
VCEval: Rethinking What is a Good Educational Video and How to Automatically Evaluate It0
CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-trainingCode1
CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language ModelsCode2
Evaluating ChatGPT-4 Vision on Brazil's National Undergraduate Computer Science ExamCode0
BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and LanguagesCode1
IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Language Models in E-commerceCode1
Bayesian Statistical Modeling with Predictors from LLMs0
AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models0
INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in InsuranceCode1
Show:102550
← PrevPage 45 of 111Next →

No leaderboard results yet.