SOTAVerified

Multiple-choice

Papers

Showing 150 of 1107 papers

TitleStatusHype
The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations0
HATS: Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models0
MateInfoUB: A Real-World Benchmark for Testing LLMs in Competitive, Multilingual, and Multimodal Educational Tasks0
Advanced Financial Reasoning at Scale: A Comprehensive Evaluation of Large Language Models on CFA Level III0
OmniEval: A Benchmark for Evaluating Omni-modal Models with Visual, Auditory, and Textual Inputs0
Adapting Vision-Language Models for Evaluating World Models0
PhysUniBench: An Undergraduate-Level Physics Reasoning Benchmark for Multimodal Models0
How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering?0
WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts0
Hypothesis Testing for Quantifying LLM-Human Misalignment in Multiple Choice Settings0
Thunder-NUBench: A Benchmark for LLMs' Sentence-Level Negation Understanding0
Training-free LLM Merging for Multi-task LearningCode0
Instruction Tuning and CoT Prompting for Contextual Medical QA with LLMs0
Different Questions, Different Models: Fine-Grained Evaluation of Uncertainty and Calibration in Clinical QA with LLMs0
A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs0
VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks0
ARGUS: Hallucination and Omission Evaluation in Video-LLMs0
Evaluating LLM-corrupted Crowdsourcing Data Without Ground Truth0
STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous DrivingCode1
Evaluating Vision-Language and Large Language Models for Automated Student Assessment in Indonesian Classrooms0
Multiple-Choice Question Generation Using Large Language Models: Methodology and Educator Insights0
LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician ValidationCode1
Do Large Language Models Know Folktales? A Case Study of Yokai in Japanese Folktales0
Performance of leading large language models in May 2025 in Membership of the Royal College of General Practitioners-style examination questions: a cross-sectional analysis0
Hanfu-Bench: A Multimodal Benchmark on Cross-Temporal Cultural Understanding and Transcreation0
Polishing Every Facet of the GEM: Testing Linguistic Competence of LLMs and Humans in KoreanCode1
PersianMedQA: Language-Centric Evaluation of LLMs in the Persian Medical Domain0
ClinBench-HPB: A Clinical Benchmark for Evaluating LLMs in Hepato-Pancreato-Biliary Diseases0
VUDG: A Dataset for Video Understanding Domain Generalization0
Beyond Multiple Choice: Evaluating Steering Vectors for Adaptive Free-Form Summarization0
Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language ModelsCode0
Simulating Training Data Leakage in Multiple-Choice Benchmarks for LLM EvaluationCode0
TCM-Ladder: A Benchmark for Multimodal Question Answering on Traditional Chinese Medicine0
SNS-Bench-VL: Benchmarking Multimodal Large Language Models in Social Networking ServicesCode0
VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-TuningCode2
MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence0
DyePack: Provably Flagging Test Set Contamination in LLMs Using BackdoorsCode0
Image Aesthetic Reasoning: A New Benchmark for Medical Image Screening with MLLMs0
Large Language Models Often Know When They Are Being Evaluated0
SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge0
My Answer Is NOT 'Fair': Mitigating Social Bias in Vision-Language Models via Fair and Biased Residuals0
Genome-Bench: A Scientific Reasoning Benchmark from Real-World Expert Discussions0
CP-Router: An Uncertainty-Aware Router Between LLM and LRM0
DFIR-Metric: A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response0
BnMMLU: Measuring Massive Multitask Language Understanding in BengaliCode0
Enhancing LLMs' Reasoning-Intensive Multimedia Search Capabilities through Fine-Tuning and Reinforcement Learning0
Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across ModalitiesCode1
KoBALT: Korean Benchmark For Advanced Linguistic Tasks0
Collaboration among Multiple Large Language Models for Medical Question Answering0
AutoMCQ -- Automatically Generate Code Comprehension Questions using GenAI0
Show:102550
← PrevPage 1 of 23Next →

No leaderboard results yet.