SOTAVerified

Multiple-choice

Papers

Showing 76100 of 1107 papers

TitleStatusHype
How well do LLMs reason over tabular data, really?0
Assessing the Chemical Intelligence of Large Language ModelsCode1
Tell Me Who Your Students Are: GPT Can Generate Valid Multiple-Choice Questions When Students' (Mis)Understanding Is Hinted0
Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information0
EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement LearningCode2
MedArabiQ: Benchmarking Large Language Models on Arabic Medical TasksCode0
ReGraP-LLaVA: Reasoning enabled Graph-based Personalized Large Language and Vision AssistantCode0
Developing A Framework to Support Human Evaluation of Bias in Generated Free Response Text0
Unlearning vs. Obfuscation: Are We Truly Removing Knowledge?0
LLM-based Text Simplification and its Effect on User Comprehension and Cognitive Load0
LookAlike: Consistent Distractor Generation in Math MCQs0
Harnessing Structured Knowledge: A Concept Map-Based Approach for High-Quality Multiple Choice Question Generation with Effective DistractorsCode0
Adaptive Wizard for Removing Cross-Tier Misconfigurations in Active Directory0
SARI: Structured Audio Reasoning via Curriculum-Guided Reinforcement Learning0
LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception0
Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video UnderstandingCode0
FarsEval-PKBETS: A new diverse benchmark for evaluating Persian large language models0
Assessing AI-Generated Questions' Alignment with Cognitive Frameworks in Educational Assessment0
DMind Benchmark: Toward a Holistic Assessment of LLM Capabilities across the Web3 Domain0
D-GEN: Automatic Distractor Generation and Evaluation for Reliable Assessment of Generative Model0
Benchmarking Next-Generation Reasoning-Focused Large Language Models in Ophthalmology: A Head-to-Head Evaluation on 5,888 Items0
AgMMU: A Comprehensive Agricultural Multimodal Understanding and Reasoning Benchmark0
Large Language Models Could Be Rote Learners0
Kaleidoscope: In-language Exams for Massively Multilingual Vision EvaluationCode0
ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question AnsweringCode1
Show:102550
← PrevPage 4 of 45Next →

No leaderboard results yet.