SOTAVerified

Multiple-choice

Papers

Showing 151160 of 1107 papers

TitleStatusHype
Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language ModelsCode2
AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language ModelsCode1
The Lazy Student's Dream: ChatGPT Passing an Engineering Course on Its Own0
LegalBench.PT: A Benchmark for Portuguese Law0
Wrong Answers Can Also Be Useful: PlausibleQA -- A Large-Scale QA Dataset with Answer Plausibility ScoresCode0
Moving Beyond Medical Exam Questions: A Clinician-Annotated Dataset of Real-World Tasks and Ambiguity in Mental HealthcareCode0
MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models0
Do LLMs Make Mistakes Like Students? Exploring Natural Alignment between Language Models and Human Error Patterns0
Unveiling Cultural Blind Spots: Analyzing the Limitations of mLLMs in Procedural Text Comprehension0
MCQA-Eval: Efficient Confidence Evaluation in NLG with Gold-Standard Correctness Labels0
Show:102550
← PrevPage 16 of 111Next →

No leaderboard results yet.