SOTAVerified

Multiple-choice

Papers

Showing 801850 of 1107 papers

TitleStatusHype
A Weak Supervision Approach for Predicting Difficulty of Technical Interview Questions0
Bayesian Statistical Modeling with Predictors from LLMs0
Being Negative but Constructively: Lessons Learnt from Creating Better Visual Question Answering Datasets0
Benchmarking Bias in Large Language Models during Role-Playing0
The Future of Learning in the Age of Generative AI: Automated Question Generation and Assessment with Large Language Models0
Answer, Assemble, Ace: Understanding How Transformers Answer Multiple Choice Questions0
The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations0
Benchmarking Next-Generation Reasoning-Focused Large Language Models in Ophthalmology: A Head-to-Head Evaluation on 5,888 Items0
Benchmarks for Pirá 2.0, a Reading Comprehension Dataset about the Ocean, the Brazilian Coast, and Climate Change0
Better Distractions: Transformer-based Distractor Generation and Multiple Choice Question Filtering0
Beyond Multiple-Choice Accuracy: Real-World Challenges of Implementing Large Language Models in Healthcare0
Beyond Multiple Choice: Evaluating Steering Vectors for Adaptive Free-Form Summarization0
Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models0
Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs0
Not All Options Are Created Equal: Textual Option Weighting for Token-Efficient LLM-Based Knowledge Tracing0
The impact of AI and peer feedback on research writing skills: a study using the CGScholar platform among Kazakhstani scholars0
LLMs May Perform MCQA by Selecting the Least Incorrect Option0
Beyond VQA: Generating Multi-word Answer and Rationale to Visual Questions0
ANPMI: Assessing the True Comprehension Capabilities of LLMs for Multiple Choice Questions0
Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination0
The Impact of Item-Writing Flaws on Difficulty and Discrimination in Item Response Theory0
A Novel Approach for Constrained Optimization in Graphical Models0
BiRdQA: A Bilingual Dataset for Question Answering on Tricky Riddles0
The Lazy Student's Dream: ChatGPT Passing an Engineering Course on Its Own0
BLINK: Multimodal Large Language Models Can See but Not Perceive0
An MRC Framework for Semantic Role Labeling0
BloomVQA: Assessing Hierarchical Multi-modal Comprehension0
The Order Effect: Investigating Prompt Sensitivity to Input Order in LLMs0
The Role of Large Language Models in Musicology: Are We Ready to Trust the Machines?0
Break the Checkbox: Challenging Closed-Style Evaluations of Cultural Alignment in LLMs0
The Use of Artificial Intelligence Tools in Assessing Content Validity: A Comparative Study with Human Experts0
Bridging Information-Seeking Human Gaze and Machine Reading Comprehension0
Bridging the Language Gap: Knowledge Injected Multilingual Question Answering0
Analysis of the Cambridge Multiple-Choice Questions Reading Dataset with a Focus on Candidate Response Distribution0
Can AI Master Construction Management (CM)? Benchmarking State-of-the-Art Large Language Models on CM Certification Exams0
Can ChatGPT pass the Vietnamese National High School Graduation Examination?0
Can Crowdsourcing be used for Effective Annotation of Arabic?0
Can Generative Pre-trained Transformers (GPT) Pass Assessments in Higher Education Programming Courses?0
The use of large language models to enhance cancer clinical trial educational materials0
Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No!0
Can We Trust LLMs? Mitigate Overconfidence Bias in LLMs through Knowledge Transfer0
CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy0
ACQ: A Unified Framework for Automated Programmatic Creativity in Online Advertising0
CFinBench: A Comprehensive Chinese Financial Benchmark for Large Language Models0
CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding0
Changing Answer Order Can Decrease MMLU Accuracy0
Characterizing Large Language Models as Rationalizers of Knowledge-intensive Tasks0
What Makes Reading Comprehension Questions Difficult? Investigating Variation in Passage Sources and Question Types0
Chat-TS: Enhancing Multi-Modal Reasoning Over Time-Series and Natural Language Data0
An Improved Traditional Chinese Evaluation Suite for Foundation Model0
Show:102550
← PrevPage 17 of 23Next →

No leaderboard results yet.