SOTAVerified

Multiple-choice

Papers

Showing 5175 of 1107 papers

TitleStatusHype
Let Androids Dream of Electric Sheep: A Human-like Image Implication Understanding and Reasoning FrameworkCode1
Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Prefilling Attack0
Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets0
Set-LLM: A Permutation-Invariant LLM0
Uncovering Cultural Representation Disparities in Vision-Language Models0
WirelessMathBench: A Mathematical Modeling Benchmark for LLMs in Wireless Communications0
VideoEval-Pro: Robust and Realistic Long Video Understanding EvaluationCode4
MR. Judge: Multimodal Reasoner as a Judge0
LEXam: Benchmarking Legal Reasoning on 340 Law Exams0
Teach2Eval: An Indirect Evaluation Method for LLM by Judging How It TeachesCode0
LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?Code1
IRLBench: A Multi-modal, Culturally Grounded, Parallel Irish-English Benchmark for Open-Ended LLM Reasoning EvaluationCode1
ZeroTuning: Unlocking the Initial Token's Power to Enhance Large Language Models Without Training0
MedGUIDE: Benchmarking Clinical Decision-Making in Large Language Models0
GIE-Bench: Towards Grounded Evaluation for Text-Guided Image EditingCode1
Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert ReasonerCode2
Ranked Voting based Self-Consistency of Large Language ModelsCode1
The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think0
Are LLM-generated plain language summaries truly understandable? A large-scale crowdsourced evaluation0
KRISTEVA: Close Reading as a Novel Task for Benchmarking Interpretive Reasoning0
SafePath: Conformal Prediction for Safe LLM-Based Autonomous Navigation0
Grounding Synthetic Data Evaluations of Language Models in Unsupervised Document CorporaCode0
Benchmarking AI scientists in omics data-driven biological researchCode1
HealthBench: Evaluating Large Language Models Towards Improved Human HealthCode7
VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language ModelsCode0
Show:102550
← PrevPage 3 of 45Next →

No leaderboard results yet.