SOTAVerified

Multiple-choice

Papers

Showing 6170 of 1107 papers

TitleStatusHype
GPQA: A Graduate-Level Google-Proof Q&A BenchmarkCode2
SafetyBench: Evaluating the Safety of Large Language ModelsCode2
The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language VariantsCode2
FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language ModelsCode2
MovieChat: From Dense Token to Sparse Memory for Long Video UnderstandingCode2
SEED-Bench: Benchmarking Multimodal LLMs with Generative ComprehensionCode2
MQAG: Multiple-choice Question Answering and Generation for Assessing Information Consistency in SummarizationCode2
Perception Test: A Diagnostic Benchmark for Multimodal ModelsCode2
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question AnsweringCode2
MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question AnsweringCode2
Show:102550
← PrevPage 7 of 111Next →

No leaderboard results yet.