SOTAVerified

Multiple-choice

Papers

Showing 301325 of 1107 papers

TitleStatusHype
AutoMCQ -- Automatically Generate Code Comprehension Questions using GenAI0
KoBALT: Korean Benchmark For Advanced Linguistic Tasks0
Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets0
Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Prefilling Attack0
Set-LLM: A Permutation-Invariant LLM0
Uncovering Cultural Representation Disparities in Vision-Language Models0
WirelessMathBench: A Mathematical Modeling Benchmark for LLMs in Wireless Communications0
MR. Judge: Multimodal Reasoner as a Judge0
LEXam: Benchmarking Legal Reasoning on 340 Law Exams0
Teach2Eval: An Indirect Evaluation Method for LLM by Judging How It TeachesCode0
MedGUIDE: Benchmarking Clinical Decision-Making in Large Language Models0
ZeroTuning: Unlocking the Initial Token's Power to Enhance Large Language Models Without Training0
Are LLM-generated plain language summaries truly understandable? A large-scale crowdsourced evaluation0
The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think0
KRISTEVA: Close Reading as a Novel Task for Benchmarking Interpretive Reasoning0
SafePath: Conformal Prediction for Safe LLM-Based Autonomous Navigation0
VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language ModelsCode0
Grounding Synthetic Data Evaluations of Language Models in Unsupervised Document CorporaCode0
How well do LLMs reason over tabular data, really?0
Tell Me Who Your Students Are: GPT Can Generate Valid Multiple-Choice Questions When Students' (Mis)Understanding Is Hinted0
Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information0
ReGraP-LLaVA: Reasoning enabled Graph-based Personalized Large Language and Vision AssistantCode0
MedArabiQ: Benchmarking Large Language Models on Arabic Medical TasksCode0
Unlearning vs. Obfuscation: Are We Truly Removing Knowledge?0
Developing A Framework to Support Human Evaluation of Bias in Generated Free Response Text0
Show:102550
← PrevPage 13 of 45Next →

No leaderboard results yet.