SOTAVerified

Multiple-choice

Papers

Showing 651700 of 1107 papers

TitleStatusHype
Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?0
Analyzing the Performance of ChatGPT in Cardiology and Vascular Pathologies0
Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information0
HFL-RC System at SemEval-2018 Task 11: Hybrid Multi-Aspects Model for Commonsense Reading Comprehension0
Hierarchical Divide-and-Conquer for Fine-Grained Alignment in LLM-Based Medical Evaluation0
HindiLLM: Large Language Model for Hindi0
Analyzing Multiple-Choice Reading and Listening Comprehension Tests0
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites0
How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering?0
How Many Workers to Ask? Adaptive Exploration for Collecting High Quality Labels0
How Susceptible are LLMs to Influence in Prompts?0
How well do LLMs reason over tabular data, really?0
HRCA+: Advanced Multiple-choice Machine Reading Comprehension Method0
Humanity's Last Exam0
Humans and Large Language Models in Clinical Decision Support: A Study with Medical Calculators0
Hypothesis Testing for Quantifying LLM-Human Misalignment in Multiple Choice Settings0
Identification of mental fatigue in language comprehension tasks based on EEG and deep learning0
Treatment Effects with Multidimensional Unobserved Heterogeneity: Identification of the Marginal Treatment Effect0
Identifying Multiple Personalities in Large Language Models with External Evaluation0
Identity Lock: Locking API Fine-tuned LLMs With Identity-based Wake Words0
IIE-NLP-Eyas at SemEval-2021 Task 4: Enhancing PLM for ReCAM with Special Tokens, Re-Ranking, Siamese Encoders and Back Translation0
IIE-NLP-NUT at SemEval-2020 Task 4: Guiding PLM with Prompt Template Reconstruction Strategy for ComVE0
IllusionBench: A Large-scale and Comprehensive Benchmark for Visual Illusion Understanding in Vision-Language Models0
Image Aesthetic Reasoning: A New Benchmark for Medical Image Screening with MLLMs0
Imagery as Inquiry: Exploring A Multimodal Dataset for Conversational Recommendation0
Improved Few-Shot Image Classification Through Multiple-Choice Questions0
Improvement/Extension of Modular Systems as Combinatorial Reengineering (Survey)0
Improving Automated Distractor Generation for Math Multiple-choice Questions with Overgenerate-and-rank0
Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Prefilling Attack0
Analysing the Effect of Masking Length Distribution of MLM: An Evaluation Framework and Case Study on Chinese MRC Datasets0
Improving the Production Efficiency and Well-formedness of Automatically-Generated Multiple-Choice Cloze Vocabulary Questions0
In Case You Missed It: ARC 'Challenge' Is Not That Challenging0
TVBench: Redesigning Video-Language Evaluation0
Indirect Identification of Psychosocial Risks from Natural Language0
Inferring from Logits: Exploring Best Practices for Decoding-Free Generative Candidate Selection0
Two-Turn Debate Doesn't Help Humans Answer Hard Reading Comprehension Questions0
InnerThoughts: Disentangling Representations and Predictions in Large Language Models0
InstructionBench: An Instructional Video Understanding Benchmark0
Instruction Tuning and CoT Prompting for Contextual Medical QA with LLMs0
Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh0
Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages0
Interpretable Multi-Step Reasoning with Knowledge Extraction on Complex Healthcare Question Answering0
Investigating and Addressing Hallucinations of LLMs in Tasks Involving Negation0
Investigating Data Contamination in Modern Benchmarks for Large Language Models0
Self-Assessment Tests are Unreliable Measures of LLM Personality0
Investigating the Effectiveness of ChatGPT in Mathematical Reasoning and Problem Solving: Evidence from the Vietnamese National High School Graduation Examination0
Investigating Uncertainty Calibration of Aligned Language Models under the Multiple-Choice Setting0
WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts0
ISAAQ -- Mastering Textbook Questions with Pre-trained Transformers and Bottom-Up and Top-Down Attention0
ISAAQ - Mastering Textbook Questions with Pre-trained Transformers and Bottom-Up and Top-Down Attention0
Show:102550
← PrevPage 14 of 23Next →

No leaderboard results yet.