SOTAVerified

Multiple-choice

Papers

Showing 251260 of 1107 papers

TitleStatusHype
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language UnderstandingCode1
Evaluating GPT-3.5 and GPT-4 Models on Brazilian University Admission ExamsCode1
WIQA: A dataset for "What if..." reasoning over procedural textCode1
WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluationCode1
FaceXBench: Evaluating Multimodal LLMs on Face UnderstandingCode1
General-Purpose Question-Answering with MacawCode1
Language Model Uncertainty Quantification with Attention ChainCode1
Controlling Cloze-test Question Item Difficulty with PLM-based Surrogate Models for IRT Assessment0
Contextual Response Interpretation for Automated Structured Interviews: A Case Study in Market Research0
Analysing the Effect of Masking Length Distribution of MLM: An Evaluation Framework and Case Study on Chinese MRC Datasets0
Show:102550
← PrevPage 26 of 111Next →

No leaderboard results yet.