SOTAVerified

Multiple-choice

Papers

Showing 676700 of 1107 papers

TitleStatusHype
Language models are susceptible to incorrect patient self-diagnosis in medical applications0
Self-Assessment Tests are Unreliable Measures of LLM Personality0
SafetyBench: Evaluating the Safety of Large Language ModelsCode2
Performance of ChatGPT-3.5 and GPT-4 on the United States Medical Licensing Examination With and Without Distractions0
Use neural networks to recognize students' handwritten letters and incorrect symbols0
Large Language Models Are Not Robust Multiple Choice SelectorsCode1
An Automatic Evaluation Framework for Multi-turn Medical Consultations Capabilities of Large Language Models0
CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language ModelsCode1
INCEPTNET: Precise And Early Disease Detection Application For Medical Images AnalysesCode0
Generalised Winograd Schema and its Contextuality0
The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language VariantsCode2
Spoken Language Intelligence of Large Language Models for Language LearningCode0
Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions0
LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language ModelsCode1
FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language ModelsCode2
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering ModelsCode1
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language UnderstandingCode1
A Comparative Study of Open-Source Large Language Models, GPT-4 and Claude 2: Multiple-Choice Test Taking in Nephrology0
Automated Distractor and Feedback Generation for Math Multiple-choice Questions via In-context LearningCode0
ChatGPT for GTFS: Benchmarking LLMs on GTFS Understanding and RetrievalCode0
ReCoMIF: Reading comprehension based multi-source information fusion network for Chinese spoken language understandingCode0
MovieChat: From Dense Token to Sparse Memory for Long Video UnderstandingCode2
Distractor generation for multiple-choice questions with predictive prompting and large language modelsCode0
SEED-Bench: Benchmarking Multimodal LLMs with Generative ComprehensionCode2
A large language model-assisted education tool to provide feedback on open-ended responsesCode0
Show:102550
← PrevPage 28 of 45Next →

No leaderboard results yet.