SOTAVerified

Multiple-choice

Papers

Showing 51100 of 1107 papers

TitleStatusHype
Let Androids Dream of Electric Sheep: A Human-like Image Implication Understanding and Reasoning FrameworkCode1
Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Prefilling Attack0
Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets0
Set-LLM: A Permutation-Invariant LLM0
Uncovering Cultural Representation Disparities in Vision-Language Models0
WirelessMathBench: A Mathematical Modeling Benchmark for LLMs in Wireless Communications0
VideoEval-Pro: Robust and Realistic Long Video Understanding EvaluationCode4
MR. Judge: Multimodal Reasoner as a Judge0
LEXam: Benchmarking Legal Reasoning on 340 Law Exams0
Teach2Eval: An Indirect Evaluation Method for LLM by Judging How It TeachesCode0
LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?Code1
IRLBench: A Multi-modal, Culturally Grounded, Parallel Irish-English Benchmark for Open-Ended LLM Reasoning EvaluationCode1
MedGUIDE: Benchmarking Clinical Decision-Making in Large Language Models0
ZeroTuning: Unlocking the Initial Token's Power to Enhance Large Language Models Without Training0
GIE-Bench: Towards Grounded Evaluation for Text-Guided Image EditingCode1
Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert ReasonerCode2
Ranked Voting based Self-Consistency of Large Language ModelsCode1
Are LLM-generated plain language summaries truly understandable? A large-scale crowdsourced evaluation0
The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think0
KRISTEVA: Close Reading as a Novel Task for Benchmarking Interpretive Reasoning0
SafePath: Conformal Prediction for Safe LLM-Based Autonomous Navigation0
Grounding Synthetic Data Evaluations of Language Models in Unsupervised Document CorporaCode0
Benchmarking AI scientists in omics data-driven biological researchCode1
VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language ModelsCode0
HealthBench: Evaluating Large Language Models Towards Improved Human HealthCode7
How well do LLMs reason over tabular data, really?0
Assessing the Chemical Intelligence of Large Language ModelsCode1
Tell Me Who Your Students Are: GPT Can Generate Valid Multiple-Choice Questions When Students' (Mis)Understanding Is Hinted0
Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information0
EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement LearningCode2
MedArabiQ: Benchmarking Large Language Models on Arabic Medical TasksCode0
ReGraP-LLaVA: Reasoning enabled Graph-based Personalized Large Language and Vision AssistantCode0
Developing A Framework to Support Human Evaluation of Bias in Generated Free Response Text0
Unlearning vs. Obfuscation: Are We Truly Removing Knowledge?0
LLM-based Text Simplification and its Effect on User Comprehension and Cognitive Load0
LookAlike: Consistent Distractor Generation in Math MCQs0
Harnessing Structured Knowledge: A Concept Map-Based Approach for High-Quality Multiple Choice Question Generation with Effective DistractorsCode0
Adaptive Wizard for Removing Cross-Tier Misconfigurations in Active Directory0
SARI: Structured Audio Reasoning via Curriculum-Guided Reinforcement Learning0
LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception0
Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video UnderstandingCode0
FarsEval-PKBETS: A new diverse benchmark for evaluating Persian large language models0
Assessing AI-Generated Questions' Alignment with Cognitive Frameworks in Educational Assessment0
DMind Benchmark: Toward a Holistic Assessment of LLM Capabilities across the Web3 Domain0
D-GEN: Automatic Distractor Generation and Evaluation for Reliable Assessment of Generative Model0
Benchmarking Next-Generation Reasoning-Focused Large Language Models in Ophthalmology: A Head-to-Head Evaluation on 5,888 Items0
AgMMU: A Comprehensive Agricultural Multimodal Understanding and Reasoning Benchmark0
Large Language Models Could Be Rote Learners0
Kaleidoscope: In-language Exams for Massively Multilingual Vision EvaluationCode0
ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question AnsweringCode1
Show:102550
← PrevPage 2 of 23Next →

No leaderboard results yet.