SOTAVerified

Multiple-choice

Papers

Showing 281290 of 1107 papers

TitleStatusHype
Simulating Training Data Leakage in Multiple-Choice Benchmarks for LLM EvaluationCode0
VUDG: A Dataset for Video Understanding Domain Generalization0
PersianMedQA: Language-Centric Evaluation of LLMs in the Persian Medical Domain0
ClinBench-HPB: A Clinical Benchmark for Evaluating LLMs in Hepato-Pancreato-Biliary Diseases0
Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language ModelsCode0
Beyond Multiple Choice: Evaluating Steering Vectors for Adaptive Free-Form Summarization0
SNS-Bench-VL: Benchmarking Multimodal Large Language Models in Social Networking ServicesCode0
MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence0
Image Aesthetic Reasoning: A New Benchmark for Medical Image Screening with MLLMs0
DyePack: Provably Flagging Test Set Contamination in LLMs Using BackdoorsCode0
Show:102550
← PrevPage 29 of 111Next →

No leaderboard results yet.