SOTAVerified

Multiple-choice

Papers

Showing 451475 of 1107 papers

TitleStatusHype
DefAn: Definitive Answer Dataset for LLMs Hallucination EvaluationCode0
MuirBench: A Comprehensive Benchmark for Robust Multi-image UnderstandingCode1
OLMES: A Standard for Language Model Evaluations0
Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and ArenaCode2
BertaQA: How Much Do Language Models Know About Local Culture?Code0
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMsCode5
Towards a Personal Health Large Language Model0
Decision-Making Behavior Evaluation Framework for LLMs under Uncertain Context0
Do LLMs Recognize me, When I is not me: Assessment of LLMs Understanding of Turkish Indexical Pronouns in Indexical Shift Contexts0
Investigating and Addressing Hallucinations of LLMs in Tasks Involving Negation0
A Fine-tuning Dataset and Benchmark for Large Language Models for Protein UnderstandingCode1
LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMsCode0
CRiskEval: A Chinese Multi-Level Risk Evaluation Benchmark Dataset for Large Language ModelsCode0
M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question AnsweringCode0
Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?0
Every Answer Matters: Evaluating Commonsense with Probabilistic MeasuresCode0
Automating Turkish Educational Quiz Generation Using Large Language ModelsCode0
Order-Independence Without Fine TuningCode0
TopViewRS: Vision-Language Models as Top-View Spatial ReasonersCode1
Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical DataCode0
Explore then Determine: A GNN-LLM Synergy Framework for Reasoning over Knowledge Graph0
Strengthened Symbol Binding Makes Large Language Models Reliable Multiple-Choice SelectorsCode0
Student Answer Forecasting: Transformer-Driven Answer Choice Prediction for Language LearningCode0
Evaluating Large Language Model Biases in Persona-Steered GenerationCode0
An Automatic Question Usability Evaluation ToolkitCode0
Show:102550
← PrevPage 19 of 45Next →

No leaderboard results yet.