SOTAVerified

Multiple-choice

Papers

Showing 91100 of 1107 papers

TitleStatusHype
Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language ModelsCode1
TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic ScenesCode1
FaceXBench: Evaluating Multimodal LLMs on Face UnderstandingCode1
ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking Theory of MindCode1
ZNO-Eval: Benchmarking reasoning capabilities of large language models in UkrainianCode1
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model EvaluationCode1
Unifying Specialized Visual Encoders for Video Language ModelsCode1
Filter-then-Generate: Large Language Models with Structure-Text Adapter for Knowledge Graph CompletionCode1
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?Code1
SailCompass: Towards Reproducible and Robust Evaluation for Southeast Asian LanguagesCode1
Show:102550
← PrevPage 10 of 111Next →

No leaderboard results yet.