SOTAVerified

Multiple-choice

Papers

Showing 301325 of 1107 papers

TitleStatusHype
Enhancing LLM Evaluations: The Garbling Trick0
Benchmarking Bias in Large Language Models during Role-Playing0
R-LLaVA: Improving Med-VQA Understanding through Visual Region of Interest0
Improving Model Evaluation using SMART Filtering of Benchmark DatasetsCode3
GPT-4o System Card0
Delving into the Reversal Curse: How Far Can Large Language Models Generalize?Code1
Beyond Multiple-Choice Accuracy: Real-World Challenges of Implementing Large Language Models in Healthcare0
Large Language Models Still Exhibit Bias in Long Text0
GeoCode-GPT: A Large Language Model for Geospatial Code Generation Tasks0
Susu Box or Piggy Bank: Assessing Cultural Commonsense Knowledge between Ghana and the U.S0
How Can We Diagnose and Treat Bias in Large Language Models for Clinical Decision-Making?Code0
TimeSeriesExam: A time series understanding examCode1
Addressing Blind Guessing: Calibration of Selection Bias in Multiple-Choice Question Answering by Video Language Models0
LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs0
LAR-ECHR: A New Legal Argument Reasoning Task and Dataset for Cases of the European Court of Human Rights0
MCQG-SRefine: Multiple Choice Question Generation and Evaluation with Iterative Self-Critique, Correction, and Comparison FeedbackCode0
CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy0
Evaluating the Instruction-following Abilities of Language Models using Knowledge TasksCode0
WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluationCode1
Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answersCode0
Difficult Task Yes but Simple Task No: Unveiling the Laziness in Multimodal LLMsCode0
Not All Options Are Created Equal: Textual Option Weighting for Token-Efficient LLM-Based Knowledge Tracing0
Personalised Feedback Framework for Online Education Programmes Using Generative AI0
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language ModelsCode1
LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models0
Show:102550
← PrevPage 13 of 45Next →

No leaderboard results yet.