SOTAVerified

Multiple-choice

Papers

Showing 601650 of 1107 papers

TitleStatusHype
QRMeM: Unleash the Length Limitation through Question then Reflection Memory Mechanism0
Enhancing Distractor Generation for Multiple-Choice Questions with Retrieval Augmented Pretraining and Knowledge Graph Integration0
On the Principles behind Opinion Dynamics in Multi-Agent Systems of Large Language Models0
Aqulia-Med LLM: Pioneering Full-Process Open-Source Medical Language Models0
QOG:Question and Options Generation based on Language Model0
UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice QuestionsCode0
DetectBench: Can Large Language Model Detect and Piece Together Implicit Evidence?Code0
IPEval: A Bilingual Intellectual Property Agency Consultation Evaluation Benchmark for Large Language ModelsCode0
Grade Score: Quantifying LLM Performance in Option SelectionCode0
Balancing Rigor and Utility: Mitigating Cognitive Biases in Large Language Models for Multiple-Choice QuestionsCode0
VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment0
VCEval: Rethinking What is a Good Educational Video and How to Automatically Evaluate It0
Evaluating ChatGPT-4 Vision on Brazil's National Undergraduate Computer Science ExamCode0
DefAn: Definitive Answer Dataset for LLMs Hallucination EvaluationCode0
Bayesian Statistical Modeling with Predictors from LLMs0
AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models0
OLMES: A Standard for Language Model Evaluations0
BertaQA: How Much Do Language Models Know About Local Culture?Code0
Decision-Making Behavior Evaluation Framework for LLMs under Uncertain Context0
Towards a Personal Health Large Language Model0
Investigating and Addressing Hallucinations of LLMs in Tasks Involving Negation0
Do LLMs Recognize me, When I is not me: Assessment of LLMs Understanding of Turkish Indexical Pronouns in Indexical Shift Contexts0
CRiskEval: A Chinese Multi-Level Risk Evaluation Benchmark Dataset for Large Language ModelsCode0
LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMsCode0
Every Answer Matters: Evaluating Commonsense with Probabilistic MeasuresCode0
M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question AnsweringCode0
Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?0
Automating Turkish Educational Quiz Generation Using Large Language ModelsCode0
Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical DataCode0
Order-Independence Without Fine TuningCode0
Strengthened Symbol Binding Makes Large Language Models Reliable Multiple-Choice SelectorsCode0
Explore then Determine: A GNN-LLM Synergy Framework for Reasoning over Knowledge Graph0
Student Answer Forecasting: Transformer-Driven Answer Choice Prediction for Language LearningCode0
An Automatic Question Usability Evaluation ToolkitCode0
Evaluating Large Language Model Biases in Persona-Steered GenerationCode0
Automated Generation and Tagging of Knowledge Components from Multiple-Choice QuestionsCode0
DGRC: An Effective Fine-tuning Framework for Distractor Generation in Chinese Multi-choice Reading Comprehension0
Edinburgh Clinical NLP at MEDIQA-CORR 2024: Guiding Large Language Models with Hints0
Can We Trust LLMs? Mitigate Overconfidence Bias in LLMs through Knowledge Transfer0
iREL at SemEval-2024 Task 9: Improving Conventional Prompting Methods for Brain TeasersCode0
Eliciting Informative Text Evaluations with Large Language ModelsCode0
Imagery as Inquiry: Exploring A Multimodal Dataset for Conversational Recommendation0
Robust portfolio optimization model for electronic coupon allocation0
Exploring the Capabilities of Prompted Large Language Models in Educational and Assessment Applications0
COGNET-MD, an evaluation framework and dataset for Large Language Model benchmarks in the medical domain0
From Generalist to Specialist: Improving Large Language Models for Medical Physics Using ARCoT0
AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning0
CinePile: A Long Video Question Answering Dataset and Benchmark0
MCS-SQL: Leveraging Multiple Prompts and Multiple-Choice Selection For Text-to-SQL Generation0
Limited Ability of LLMs to Simulate Human Psychological Behaviours: a Psychometric AnalysisCode0
Show:102550
← PrevPage 13 of 23Next →

No leaderboard results yet.