SOTAVerified

Multiple-choice

Papers

Showing 651700 of 1107 papers

TitleStatusHype
Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice QuestionsCode0
WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning0
Math Multiple Choice Question Generation via Human-Large Language Model Collaboration0
FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models0
From Multiple-Choice to Extractive QA: A Case Study for English and ArabicCode0
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites0
TAXI: Evaluating Categorical Knowledge Editing for Language ModelsCode0
AI and Machine Learning for Next Generation Science Assessments0
UnibucLLM: Harnessing LLMs for Automated Prediction of Item Difficulty and Response Time for Multiple-Choice QuestionsCode0
Improving Automated Distractor Generation for Math Multiple-choice Questions with Overgenerate-and-rank0
Is There No Such Thing as a Bad Question? H4R: HalluciBot For Ratiocination, Rewriting, Ranking, and Routing0
BLINK: Multimodal Large Language Models Can See but Not Perceive0
ViLLM-Eval: A Comprehensive Evaluation Suite for Vietnamese Large Language Models0
Question Difficulty Ranking for Multiple-Choice Reading Comprehension0
Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You ThinkCode0
Automatic Generation and Evaluation of Reading Comprehension Test Items with Large Language ModelsCode0
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering0
MLaKE: Multilingual Knowledge Editing Benchmark for Large Language ModelsCode0
Cleared for Takeoff? Compositional & Conditional Reasoning may be the Achilles Heel to (Flight-Booking) Language Agents0
NLP at UC Santa Cruz at SemEval-2024 Task 5: Legal Answer Validation using Few-Shot Multi-Choice QACode0
CSEPrompts: A Benchmark of Introductory Computer Science PromptsCode0
Exploring Automated Distractor Generation for Math Multiple-choice Questions via Large Language ModelsCode0
AILS-NTUA at SemEval-2024 Task 9: Cracking Brain Teasers: Transformer Models for Lateral Thinking PuzzlesCode0
Can multiple-choice questions really be useful in detecting the abilities of LLMs?Code0
Pragmatic Competence Evaluation of Large Language Models for the Korean LanguageCode0
LHMKE: A Large-scale Holistic Multi-subject Knowledge Evaluation Benchmark for Chinese Large Language Models0
Enhancing Event Causality Identification with Rationale and Structure-Aware Causal Question Answering0
Few-Shot Image Classification and Segmentation as Visual Question Answering Using Vision-Language Models0
EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language ModelsCode0
Towards Diverse Perspective Learning with Selection over Multiple Temporal PoolingsCode0
Exploring the Comprehension of ChatGPT in Traditional Chinese Medicine Knowledge0
AraTrust: An Evaluation of Trustworthiness for LLMs in Arabic0
Rethinking Generative Large Language Model Evaluation for Semantic Comprehension0
MedKP: Medical Dialogue with Knowledge Enhancement and Clinical Pathway Encoding0
Automated Generation of Multiple-Choice Cloze Questions for Assessing English Vocabulary Using GPT-turbo 3.50
An Improved Traditional Chinese Evaluation Suite for Foundation Model0
KorMedMCQA: Multi-Choice Question Answering Benchmark for Korean Healthcare Professional Licensing Examinations0
Controlling Cloze-test Question Item Difficulty with PLM-based Surrogate Models for IRT Assessment0
Predictions from language models for multiple-choice tasks are not robust under variation of scoring methods0
Unsupervised multiple choices question answering via universal corpus0
Biomedical Entity Linking as Multiple Choice Question AnsweringCode0
"My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language ModelsCode0
Identifying Multiple Personalities in Large Language Models with External Evaluation0
Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models0
Ranking Large Language Models without Ground Truth0
KorNAT: LLM Alignment Benchmark for Korean Social Values and Common Knowledge0
Probabilities of Chat LLMs Are Miscalibrated but Still Predict Correctness on Multiple-Choice Q&ACode0
Digital Comprehensibility Assessment of Simplified Texts among Persons with Intellectual Disabilities0
Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question?Code0
Stick to your Role! Stability of Personal Values Expressed in Large Language Models0
Show:102550
← PrevPage 14 of 23Next →

No leaderboard results yet.