Multiple-choice

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 551–600 of 1107 papers

Title	Date	Tasks	Status	Hype
Controlling Cloze-test Question Item Difficulty with PLM-based Surrogate Models for IRT Assessment	Mar 3, 2024	Cloze TestMultiple-choice	—Unverified	0
ParallelPARC: A Scalable Pipeline for Generating Natural-Language Analogies	Mar 2, 2024	Multiple-choice	CodeCode Available	1
Predictions from language models for multiple-choice tasks are not robust under variation of scoring methods	Mar 1, 2024	Multiple-choice	—Unverified	0
NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese Journalism	Feb 29, 2024	EthicsMultiple-choice	CodeCode Available	1
Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions	Feb 28, 2024	BenchmarkingMultiple-choice	CodeCode Available	1
NextLevelBERT: Masked Language Modeling with Higher-Level Representations for Long Documents	Feb 27, 2024	Document ClassificationLanguage Modeling	CodeCode Available	1
Unsupervised multiple choices question answering via universal corpus	Feb 27, 2024	FormKnowledge Graphs	—Unverified	0
Leveraging Large Language Models for Learning Complex Legal Concepts through Storytelling	Feb 26, 2024	Multiple-choice	CodeCode Available	1
Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models	Feb 26, 2024	Multiple-choice	CodeCode Available	1
MoZIP: A Multilingual Benchmark to Evaluate Large Language Models in Intellectual Property	Feb 26, 2024	Language ModelingLanguage Modelling	CodeCode Available	1
SportQA: A Benchmark for Sports Understanding in Large Language Models	Feb 24, 2024	Few-Shot LearningMultiple-choice	CodeCode Available	1
Biomedical Entity Linking as Multiple Choice Question Answering	Feb 23, 2024	Entity LinkingMultiple-choice	CodeCode Available	0
ToMBench: Benchmarking Theory of Mind in Large Language Models	Feb 23, 2024	BenchmarkingMultiple-choice	CodeCode Available	2
tinyBenchmarks: evaluating LLMs with fewer examples	Feb 22, 2024	MMLUMultiple-choice	CodeCode Available	2
Uncertainty-Aware Evaluation for Vision-Language Models	Feb 22, 2024	Conformal PredictionLanguage Modeling	CodeCode Available	1
Identifying Multiple Personalities in Large Language Models with External Evaluation	Feb 22, 2024	Multiple-choice	—Unverified	0
"My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models	Feb 22, 2024	Multiple-choiceText Generation	CodeCode Available	0
Ranking Large Language Models without Ground Truth	Feb 21, 2024	Multiple-choiceTriplet	—Unverified	0
Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models	Feb 21, 2024	Multiple-choice	—Unverified	0
KorNAT: LLM Alignment Benchmark for Korean Social Values and Common Knowledge	Feb 21, 2024	4kMultiple-choice	—Unverified	0
Digital Comprehensibility Assessment of Simplified Texts among Persons with Intellectual Disabilities	Feb 20, 2024	Multiple-choiceText Simplification	—Unverified	0
BiMediX: Bilingual Medical Mixture of Experts LLM	Feb 20, 2024	Mixture-of-ExpertsMultiple-choice	CodeCode Available	1
ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic	Feb 20, 2024	ArabicMMLULanguage Model Evaluation	CodeCode Available	1
Probabilities of Chat LLMs Are Miscalibrated but Still Predict Correctness on Multiple-Choice Q&A	Feb 20, 2024	Language ModellingLarge Language Model	CodeCode Available	0
Stick to your Role! Stability of Personal Values Expressed in Large Language Models	Feb 19, 2024	Multiple-choice	—Unverified	0
Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question?	Feb 19, 2024	Decision MakingMemorization	CodeCode Available	0
Uncertainty quantification in fine-tuned LLMs using LoRA ensembles	Feb 19, 2024	Multiple-choiceUncertainty Quantification	CodeCode Available	0
KMMLU: Measuring Massive Multitask Language Understanding in Korean	Feb 18, 2024	kmmluLanguage Model Evaluation	—Unverified	0
Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering	Feb 16, 2024	Language ModelingLanguage Modelling	CodeCode Available	0
DE-COP: Detecting Copyrighted Content in Language Models Training Data	Feb 15, 2024	Language ModelingLanguage Modelling	CodeCode Available	0
CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge	Feb 12, 2024	General KnowledgeMultiple-choice	CodeCode Available	2
The Effect of Sampling Temperature on Problem Solving in Large Language Models	Feb 7, 2024	Multiple-choicePrompt Engineering	CodeCode Available	1
Prompting Implicit Discourse Relation Annotation	Feb 7, 2024	ClassificationImplicit Discourse Relation Classification	—Unverified	0
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models	Feb 7, 2024	DiversityMultiple-choice	CodeCode Available	2
SceMQA: A Scientific College Entrance Level Multimodal Question Answering Benchmark	Feb 6, 2024	Multiple-choiceQuestion Answering	—Unverified	0
Are Machines Better at Complex Reasoning? Unveiling Human-Machine Inference Gaps in Entailment Verification	Feb 6, 2024	BenchmarkingMultiple-choice	—Unverified	0
SHIELD : An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models	Feb 6, 2024	AttributeFace Anti-Spoofing	CodeCode Available	1
Enhancing textual textbook question answering with large language models and retrieval augmented generation	Feb 5, 2024	Multiple-choiceQuestion Answering	CodeCode Available	0
LLMs May Perform MCQA by Selecting the Least Incorrect Option	Feb 2, 2024	Multiple-choiceMultiple Choice Question Answering (MCQA)	—Unverified	0
Distractor Generation in Multiple-Choice Tasks: A Survey of Methods, Datasets, and Evaluation	Feb 2, 2024	Distractor GenerationMultiple-choice	—Unverified	0
When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards	Feb 1, 2024	Answer SelectionLanguage Modeling	CodeCode Available	0
An Information-Theoretic Approach to Analyze NLP Classification Tasks	Feb 1, 2024	Multiple-choiceReading Comprehension	CodeCode Available	0
I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBench	Jan 31, 2024	BenchmarkingMultiple-choice	CodeCode Available	4
E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models	Jan 29, 2024	EthicsMultiple-choice	CodeCode Available	1
Evaluating LLM -- Generated Multimodal Diagnosis from Medical Images and Symptom Analysis	Jan 28, 2024	Knowledge GraphsMedical Diagnosis	—Unverified	0
Improving Medical Reasoning through Retrieval and Self-Reflection with Retrieval-Augmented Large Language Models	Jan 27, 2024	Medical Question AnsweringMultiple-choice	CodeCode Available	2
Towards Collective Superintelligence: Amplifying Group IQ using Conversational Swarms	Jan 25, 2024	Decision MakingMultiple-choice	—Unverified	0
LongHealth: A Question Answering Benchmark with Long Clinical Documents	Jan 25, 2024	Information RetrievalMultiple-choice	CodeCode Available	1
CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning	Jan 25, 2024	Multiple-choicePosition	CodeCode Available	1
What Large Language Models Know and What People Think They Know	Jan 24, 2024	ArticlesDecision Making	—Unverified	0

Show:10 25 50

← PrevPage 12 of 23Next →

No leaderboard results yet.