SOTAVerified

Multiple-choice

Papers

Showing 521530 of 1107 papers

TitleStatusHype
AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset0
LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs0
DP-SSL: Towards Robust Semi-supervised Learning with A Few Labeled Samples0
Do LLMs Recognize me, When I is not me: Assessment of LLMs Understanding of Turkish Indexical Pronouns in Indexical Shift Contexts0
Benchmarks for Pirá 2.0, a Reading Comprehension Dataset about the Ocean, the Brazilian Coast, and Climate Change0
Do LLMs Make Mistakes Like Students? Exploring Natural Alignment between Language Models and Human Error Patterns0
Do LLMs Know When to NOT Answer? Investigating Abstention Abilities of Large Language Models0
Benchmarking Next-Generation Reasoning-Focused Large Language Models in Ophthalmology: A Head-to-Head Evaluation on 5,888 Items0
Do LLMs Act as Repositories of Causal Knowledge?0
Do Large Language Models Know Folktales? A Case Study of Yokai in Japanese Folktales0
Show:102550
← PrevPage 53 of 111Next →

No leaderboard results yet.