SOTAVerified

Multiple-choice

Papers

Showing 351400 of 1107 papers

TitleStatusHype
Evaluating the Symbol Binding Ability of Large Language Models for Multiple-Choice Questions in Vietnamese General Education0
Evaluating the Rationale Understanding of Critical Reasoning in Logical Reading Comprehension0
Bridging Information-Seeking Human Gaze and Machine Reading Comprehension0
Genome-Bench: A Scientific Reasoning Benchmark from Real-World Expert Discussions0
GPT-4o System Card0
Hanfu-Bench: A Multimodal Benchmark on Cross-Temporal Cultural Understanding and Transcreation0
Humans and Large Language Models in Clinical Decision Support: A Study with Medical Calculators0
Evaluating the Performance and Robustness of LLMs in Materials Science Q&A and Property Predictions0
AI-based Arabic Language and Speech Tutor0
Generating Correct Answers for Progressive Matrices Intelligence Tests0
Answering Science Exam Questions Using Query Reformulation with Background Knowledge0
Evaluating multiple large language models in pediatric ophthalmology0
ActionAtlas: A VideoQA Benchmark for Domain-specialized Action Recognition0
Generating Diagnostic Multiple Choice Comprehension Cloze Questions0
Answering Science Exam Questions Using Query Rewriting with Background Knowledge0
Evaluating LLMs on Document-Based QA: Exact Answer Selection and Numerical Extraction using Cogtale dataset0
Evaluating LLM -- Generated Multimodal Diagnosis from Medical Images and Symptom Analysis0
BloomVQA: Assessing Hierarchical Multi-modal Comprehension0
AI and Machine Learning for Next Generation Science Assessments0
Evaluating LLM-corrupted Crowdsourcing Data Without Ground Truth0
Answering Questions in Stages: Prompt Chaining for Contract QA0
BLINK: Multimodal Large Language Models Can See but Not Perceive0
Evaluating Machine Reading Systems through Comprehension Tests0
ACQ: A Unified Framework for Automated Programmatic Creativity in Online Advertising0
Generating Adequate Distractors for Multiple-Choice Questions0
Evaluating Nuanced Bias in Large Language Model Free Response Answers0
Answering questions by learning to rank - Learning to rank by answering questions0
Evaluating Question Answering Evaluation0
Break the Checkbox: Challenging Closed-Style Evaluations of Cultural Alignment in LLMs0
GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis0
Evalita-LLM: Benchmarking Large Language Models on Italian0
Evaluating the Potential of Leading Large Language Models in Reasoning Biology Questions0
BiRdQA: A Bilingual Dataset for Question Answering on Tricky Riddles0
Generalised Winograd Schema and its Contextuality0
Evaluating Vision-Language and Large Language Models for Automated Student Assessment in Indonesian Classrooms0
Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration0
Evaluation of Automatically Generated Pronoun Reference Questions0
Answer Uncertainty and Unanswerability in Multiple-Choice Machine Reading Comprehension0
Analysis of the Cambridge Multiple-Choice Questions Reading Dataset with a Focus on Candidate Response Distribution0
Examining the Behavior of LLM Architectures Within the Framework of Standardized National Exams in Brazil0
Establishing Task Scaling Laws via Compute-Efficient Model Ladders0
EQUATOR: A Deterministic Framework for Evaluating LLM Reasoning with Open-Ended Questions. # v1.0.0-beta0
Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation0
ExplanationLP: Abductive Reasoning for Explainable Science Question Answering0
Can ChatGPT pass the Vietnamese National High School Graduation Examination?0
Answering questions by learning to rank -- Learning to rank by answering questions0
Explore then Determine: A GNN-LLM Synergy Framework for Reasoning over Knowledge Graph0
Can Crowdsourcing be used for Effective Annotation of Arabic?0
Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data0
Enhancing Multiple-Choice Question Answering with Causal Knowledge0
Show:102550
← PrevPage 8 of 23Next →

No leaderboard results yet.