SOTAVerified

Multiple-choice

Papers

Showing 501550 of 1107 papers

TitleStatusHype
Do Large Language Models Understand Conversational Implicature -- A case study with a chinese sitcomCode1
FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models0
PLAYER*: Enhancing LLM-based Multi-Agent Communication and Interaction in Murder Mystery GamesCode2
From Multiple-Choice to Extractive QA: A Case Study for English and ArabicCode0
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites0
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual ComprehensionCode3
AI and Machine Learning for Next Generation Science Assessments0
TAXI: Evaluating Categorical Knowledge Editing for Language ModelsCode0
UnibucLLM: Harnessing LLMs for Automated Prediction of Item Difficulty and Response Time for Multiple-Choice QuestionsCode0
Improving Automated Distractor Generation for Math Multiple-choice Questions with Overgenerate-and-rank0
Is There No Such Thing as a Bad Question? H4R: HalluciBot For Ratiocination, Rewriting, Ranking, and Routing0
BLINK: Multimodal Large Language Models Can See but Not Perceive0
ViLLM-Eval: A Comprehensive Evaluation Suite for Vietnamese Large Language Models0
Question Difficulty Ranking for Multiple-Choice Reading Comprehension0
Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You ThinkCode0
Automatic Generation and Evaluation of Reading Comprehension Test Items with Large Language ModelsCode0
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering0
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video UnderstandingCode3
MLaKE: Multilingual Knowledge Editing Benchmark for Large Language ModelsCode0
Cleared for Takeoff? Compositional & Conditional Reasoning may be the Achilles Heel to (Flight-Booking) Language Agents0
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual TokensCode4
NLP at UC Santa Cruz at SemEval-2024 Task 5: Legal Answer Validation using Few-Shot Multi-Choice QACode0
CSEPrompts: A Benchmark of Introductory Computer Science PromptsCode0
Exploring Automated Distractor Generation for Math Multiple-choice Questions via Large Language ModelsCode0
AILS-NTUA at SemEval-2024 Task 9: Cracking Brain Teasers: Transformer Models for Lateral Thinking PuzzlesCode0
Latxa: An Open Language Model and Evaluation Suite for BasqueCode1
Non-Linear Inference Time Intervention: Improving LLM TruthfulnessCode1
BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical TextCode4
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLMCode2
Can multiple-choice questions really be useful in detecting the abilities of LLMs?Code0
PCToolkit: A Unified Plug-and-Play Prompt Compression Toolkit of Large Language ModelsCode3
Understanding Long Videos with Multimodal Language ModelsCode2
IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language ModelsCode1
Pragmatic Competence Evaluation of Large Language Models for the Korean LanguageCode0
LHMKE: A Large-scale Holistic Multi-subject Knowledge Evaluation Benchmark for Chinese Large Language Models0
Enhancing Event Causality Identification with Rationale and Structure-Aware Causal Question Answering0
Few-Shot Image Classification and Segmentation as Visual Question Answering Using Vision-Language Models0
EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language ModelsCode0
Towards Diverse Perspective Learning with Selection over Multiple Temporal PoolingsCode0
AraTrust: An Evaluation of Trustworthiness for LLMs in Arabic0
Exploring the Comprehension of ChatGPT in Traditional Chinese Medicine Knowledge0
Rethinking Generative Large Language Model Evaluation for Semantic Comprehension0
Complex Reasoning over Logical Queries on Commonsense Knowledge GraphsCode1
MedKP: Medical Dialogue with Knowledge Enhancement and Clinical Pathway Encoding0
Unfamiliar Finetuning Examples Control How Language Models HallucinateCode1
The WMDP Benchmark: Measuring and Reducing Malicious Use With UnlearningCode4
An Improved Traditional Chinese Evaluation Suite for Foundation Model0
Automated Generation of Multiple-Choice Cloze Questions for Assessing English Vocabulary Using GPT-turbo 3.50
To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question AnsweringCode1
KorMedMCQA: Multi-Choice Question Answering Benchmark for Korean Healthcare Professional Licensing Examinations0
Show:102550
← PrevPage 11 of 23Next →

No leaderboard results yet.