SOTAVerified

Multiple-choice

Papers

Showing 281290 of 1107 papers

TitleStatusHype
SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation for reference-free open-ended text0
AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset0
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation0
Testing Uncertainty of Large Language Models for Physics Knowledge and Reasoning0
VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?Code1
A Benchmark for Long-Form Medical Question AnsweringCode0
DAHL: Domain-specific Automated Hallucination Evaluation of Long-Form Text through a Benchmark Dataset in BiomedicineCode0
TRACE: Transformer-based Risk Assessment for Clinical EvaluationCode0
SHARP: Unlocking Interactive Hallucination via Stance Transfer in Role-Playing Agents0
IdentifyMe: A Challenging Long-Context Mention Resolution Benchmark for LLMsCode0
Show:102550
← PrevPage 29 of 111Next →

No leaderboard results yet.