SOTAVerified

Multiple-choice

Papers

Showing 301350 of 1107 papers

TitleStatusHype
KoBALT: Korean Benchmark For Advanced Linguistic Tasks0
AutoMCQ -- Automatically Generate Code Comprehension Questions using GenAI0
Set-LLM: A Permutation-Invariant LLM0
Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Prefilling Attack0
Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets0
WirelessMathBench: A Mathematical Modeling Benchmark for LLMs in Wireless Communications0
Uncovering Cultural Representation Disparities in Vision-Language Models0
MR. Judge: Multimodal Reasoner as a Judge0
LEXam: Benchmarking Legal Reasoning on 340 Law Exams0
Teach2Eval: An Indirect Evaluation Method for LLM by Judging How It TeachesCode0
ZeroTuning: Unlocking the Initial Token's Power to Enhance Large Language Models Without Training0
MedGUIDE: Benchmarking Clinical Decision-Making in Large Language Models0
Are LLM-generated plain language summaries truly understandable? A large-scale crowdsourced evaluation0
The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think0
KRISTEVA: Close Reading as a Novel Task for Benchmarking Interpretive Reasoning0
SafePath: Conformal Prediction for Safe LLM-Based Autonomous Navigation0
VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language ModelsCode0
Grounding Synthetic Data Evaluations of Language Models in Unsupervised Document CorporaCode0
How well do LLMs reason over tabular data, really?0
Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information0
Tell Me Who Your Students Are: GPT Can Generate Valid Multiple-Choice Questions When Students' (Mis)Understanding Is Hinted0
ReGraP-LLaVA: Reasoning enabled Graph-based Personalized Large Language and Vision AssistantCode0
MedArabiQ: Benchmarking Large Language Models on Arabic Medical TasksCode0
Unlearning vs. Obfuscation: Are We Truly Removing Knowledge?0
Developing A Framework to Support Human Evaluation of Bias in Generated Free Response Text0
LLM-based Text Simplification and its Effect on User Comprehension and Cognitive Load0
LookAlike: Consistent Distractor Generation in Math MCQs0
Adaptive Wizard for Removing Cross-Tier Misconfigurations in Active Directory0
Harnessing Structured Knowledge: A Concept Map-Based Approach for High-Quality Multiple Choice Question Generation with Effective DistractorsCode0
SARI: Structured Audio Reasoning via Curriculum-Guided Reinforcement Learning0
LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception0
Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video UnderstandingCode0
FarsEval-PKBETS: A new diverse benchmark for evaluating Persian large language models0
Assessing AI-Generated Questions' Alignment with Cognitive Frameworks in Educational Assessment0
DMind Benchmark: Toward a Holistic Assessment of LLM Capabilities across the Web3 Domain0
D-GEN: Automatic Distractor Generation and Evaluation for Reliable Assessment of Generative Model0
Benchmarking Next-Generation Reasoning-Focused Large Language Models in Ophthalmology: A Head-to-Head Evaluation on 5,888 Items0
AgMMU: A Comprehensive Agricultural Multimodal Understanding and Reasoning Benchmark0
Large Language Models Could Be Rote Learners0
Kaleidoscope: In-language Exams for Massively Multilingual Vision EvaluationCode0
InstructionBench: An Instructional Video Understanding Benchmark0
Can AI Master Construction Management (CM)? Benchmarking State-of-the-Art Large Language Models on CM Certification Exams0
From ChatGPT to DeepSeek AI: A Comprehensive Analysis of Evolution, Deviation, and Future Implications in AI-Language Models0
VEGAS: Towards Visually Explainable and Grounded Artificial Social IntelligenceCode0
ACPBench Hard: Unrestrained Reasoning about Action, Change, and Planning0
Question-Aware Knowledge Graph Prompting for Enhancing Large Language ModelsCode0
Order Independence With Finetuning0
Unmasking Deceptive Visuals: Benchmarking Multimodal Large Language Models on Misleading Chart Question Answering0
Evaluating Clinical Competencies of Large Language Models with a General Practice Benchmark0
SaudiCulture: A Benchmark for Evaluating Large Language Models Cultural Competence within Saudi Arabia0
Show:102550
← PrevPage 7 of 23Next →

No leaderboard results yet.