SOTAVerified

Multiple-choice

Papers

Showing 251275 of 1107 papers

TitleStatusHype
Unsupervised Commonsense Question Answering with Self-TalkCode1
R2DE: a NLP approach to estimating IRT parameters of newly generated questionsCode1
WIQA: A dataset for "What if..." reasoning over procedural textCode1
CommonsenseQA: A Question Answering Challenge Targeting Commonsense KnowledgeCode1
Generating Distractors for Reading Comprehension Questions from Real ExaminationsCode1
Constructing Narrative Event Evolutionary Graph for Script Event PredictionCode1
VQA: Visual Question AnsweringCode1
The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations0
HATS: Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models0
MateInfoUB: A Real-World Benchmark for Testing LLMs in Competitive, Multilingual, and Multimodal Educational Tasks0
Advanced Financial Reasoning at Scale: A Comprehensive Evaluation of Large Language Models on CFA Level III0
OmniEval: A Benchmark for Evaluating Omni-modal Models with Visual, Auditory, and Textual Inputs0
Adapting Vision-Language Models for Evaluating World Models0
PhysUniBench: An Undergraduate-Level Physics Reasoning Benchmark for Multimodal Models0
How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering?0
WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts0
Hypothesis Testing for Quantifying LLM-Human Misalignment in Multiple Choice Settings0
Thunder-NUBench: A Benchmark for LLMs' Sentence-Level Negation Understanding0
Training-free LLM Merging for Multi-task LearningCode0
Instruction Tuning and CoT Prompting for Contextual Medical QA with LLMs0
Different Questions, Different Models: Fine-Grained Evaluation of Uncertainty and Calibration in Clinical QA with LLMs0
A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs0
VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks0
ARGUS: Hallucination and Omission Evaluation in Video-LLMs0
Evaluating LLM-corrupted Crowdsourcing Data Without Ground Truth0
Show:102550
← PrevPage 11 of 45Next →

No leaderboard results yet.