SOTAVerified

Multiple-choice

Papers

Showing 326350 of 1107 papers

TitleStatusHype
LLM-based Text Simplification and its Effect on User Comprehension and Cognitive Load0
LookAlike: Consistent Distractor Generation in Math MCQs0
Adaptive Wizard for Removing Cross-Tier Misconfigurations in Active Directory0
Harnessing Structured Knowledge: A Concept Map-Based Approach for High-Quality Multiple Choice Question Generation with Effective DistractorsCode0
SARI: Structured Audio Reasoning via Curriculum-Guided Reinforcement Learning0
LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception0
Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video UnderstandingCode0
FarsEval-PKBETS: A new diverse benchmark for evaluating Persian large language models0
Assessing AI-Generated Questions' Alignment with Cognitive Frameworks in Educational Assessment0
DMind Benchmark: Toward a Holistic Assessment of LLM Capabilities across the Web3 Domain0
D-GEN: Automatic Distractor Generation and Evaluation for Reliable Assessment of Generative Model0
Benchmarking Next-Generation Reasoning-Focused Large Language Models in Ophthalmology: A Head-to-Head Evaluation on 5,888 Items0
AgMMU: A Comprehensive Agricultural Multimodal Understanding and Reasoning Benchmark0
Large Language Models Could Be Rote Learners0
Kaleidoscope: In-language Exams for Massively Multilingual Vision EvaluationCode0
InstructionBench: An Instructional Video Understanding Benchmark0
Can AI Master Construction Management (CM)? Benchmarking State-of-the-Art Large Language Models on CM Certification Exams0
From ChatGPT to DeepSeek AI: A Comprehensive Analysis of Evolution, Deviation, and Future Implications in AI-Language Models0
VEGAS: Towards Visually Explainable and Grounded Artificial Social IntelligenceCode0
ACPBench Hard: Unrestrained Reasoning about Action, Change, and Planning0
Question-Aware Knowledge Graph Prompting for Enhancing Large Language ModelsCode0
Order Independence With Finetuning0
Unmasking Deceptive Visuals: Benchmarking Multimodal Large Language Models on Misleading Chart Question Answering0
Evaluating Clinical Competencies of Large Language Models with a General Practice Benchmark0
SaudiCulture: A Benchmark for Evaluating Large Language Models Cultural Competence within Saudi Arabia0
Show:102550
← PrevPage 14 of 45Next →

No leaderboard results yet.