Multiple-choice

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 451–500 of 1107 papers

Title	Date	Tasks	Status	Hype
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding	Jun 13, 2024	Multiple-choiceScene Understanding	CodeCode Available	1
DefAn: Definitive Answer Dataset for LLMs Hallucination Evaluation	Jun 13, 2024	BenchmarkingHallucination	CodeCode Available	0
OLMES: A Standard for Language Model Evaluations	Jun 12, 2024	Language ModelingLanguage Modelling	—Unverified	0
Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena	Jun 11, 2024	Multiple-choiceSelection bias	CodeCode Available	2
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	Jun 11, 2024	Multiple-choiceQuestion Answering	CodeCode Available	5
BertaQA: How Much Do Language Models Know About Local Culture?	Jun 11, 2024	Multiple-choiceTransfer Learning	CodeCode Available	0
Towards a Personal Health Large Language Model	Jun 10, 2024	Language ModelingLanguage Modelling	—Unverified	0
Decision-Making Behavior Evaluation Framework for LLMs under Uncertain Context	Jun 10, 2024	Decision MakingMultiple-choice	—Unverified	0
Investigating and Addressing Hallucinations of LLMs in Tasks Involving Negation	Jun 8, 2024	Abstractive Text SummarizationDialogue Generation	—Unverified	0
Do LLMs Recognize me, When I is not me: Assessment of LLMs Understanding of Turkish Indexical Pronouns in Indexical Shift Contexts	Jun 8, 2024	Machine TranslationMultiple-choice	—Unverified	0
A Fine-tuning Dataset and Benchmark for Large Language Models for Protein Understanding	Jun 8, 2024	DescriptiveLanguage Modelling	CodeCode Available	1
LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMs	Jun 7, 2024	Mathematical ReasoningMultiple-choice	CodeCode Available	0
CRiskEval: A Chinese Multi-Level Risk Evaluation Benchmark Dataset for Large Language Models	Jun 7, 2024	Multiple-choicePhilosophy	CodeCode Available	0
M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question Answering	Jun 6, 2024	abstractive question answeringClinical Knowledge	CodeCode Available	0
Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?	Jun 6, 2024	Multiple-choiceQuestion Answering	—Unverified	0
Every Answer Matters: Evaluating Commonsense with Probabilistic Measures	Jun 6, 2024	Common Sense ReasoningLanguage Modeling	CodeCode Available	0
Automating Turkish Educational Quiz Generation Using Large Language Models	Jun 5, 2024	Multiple-choice	CodeCode Available	0
Order-Independence Without Fine Tuning	Jun 4, 2024	Language ModellingMultiple-choice	CodeCode Available	0
TopViewRS: Vision-Language Models as Top-View Spatial Reasoners	Jun 4, 2024	Multiple-choiceSpatial Reasoning	CodeCode Available	1
Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data	Jun 4, 2024	Clinical KnowledgeMultiple-choice	CodeCode Available	0
Explore then Determine: A GNN-LLM Synergy Framework for Reasoning over Knowledge Graph	Jun 3, 2024	Knowledge GraphsMultiple-choice	—Unverified	0
Strengthened Symbol Binding Makes Large Language Models Reliable Multiple-Choice Selectors	Jun 3, 2024	Multiple-choiceSelection bias	CodeCode Available	0
Evaluating Large Language Model Biases in Persona-Steered Generation	May 30, 2024	Language ModelingLanguage Modelling	CodeCode Available	0
Student Answer Forecasting: Transformer-Driven Answer Choice Prediction for Language Learning	May 30, 2024	MisconceptionsMultiple-choice	CodeCode Available	0
An Automatic Question Usability Evaluation Toolkit	May 30, 2024	Multiple-choiceWord Embeddings	CodeCode Available	0
Automated Generation and Tagging of Knowledge Components from Multiple-Choice Questions	May 30, 2024	Language ModellingLarge Language Model	CodeCode Available	0
DGRC: An Effective Fine-tuning Framework for Distractor Generation in Chinese Multi-choice Reading Comprehension	May 29, 2024	Distractor GenerationMultiple-choice	—Unverified	0
Edinburgh Clinical NLP at MEDIQA-CORR 2024: Guiding Large Language Models with Hints	May 28, 2024	Multiple-choiceSentence	—Unverified	0
Can We Trust LLMs? Mitigate Overconfidence Bias in LLMs through Knowledge Transfer	May 27, 2024	Multiple-choiceSentiment Analysis	—Unverified	0
iREL at SemEval-2024 Task 9: Improving Conventional Prompting Methods for Brain Teasers	May 25, 2024	Common Sense ReasoningMultiple-choice	CodeCode Available	0
Eliciting Informative Text Evaluations with Large Language Models	May 23, 2024	Multiple-choicePrediction	CodeCode Available	0
Imagery as Inquiry: Exploring A Multimodal Dataset for Conversational Recommendation	May 23, 2024	Conversational RecommendationMultiple-choice	—Unverified	0
Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation	May 22, 2024	InformativenessLanguage Modeling	CodeCode Available	2
Embedding Trajectory for Out-of-Distribution Detection in Mathematical Reasoning	May 22, 2024	Mathematical ReasoningMultiple-choice	CodeCode Available	1
Robust portfolio optimization model for electronic coupon allocation	May 21, 2024	Multiple-choicePortfolio Optimization	—Unverified	0
Multiple-Choice Questions are Efficient and Robust LLM Evaluators	May 20, 2024	GSM8KHumanEval	CodeCode Available	1
Exploring the Capabilities of Prompted Large Language Models in Educational and Assessment Applications	May 19, 2024	Multiple-choice	—Unverified	0
From Generalist to Specialist: Improving Large Language Models for Medical Physics Using ARCoT	May 17, 2024	BenchmarkingMultiple-choice	—Unverified	0
Benchmarking Large Language Models on CFLUE -- A Chinese Financial Language Understanding Evaluation Dataset	May 17, 2024	16kBenchmarking	CodeCode Available	3
COGNET-MD, an evaluation framework and dataset for Large Language Model benchmarks in the medical domain	May 17, 2024	Language ModelingLanguage Modelling	—Unverified	0
AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning	May 16, 2024	Multiple-choiceQuestion Answering	—Unverified	0
CinePile: A Long Video Question Answering Dataset and Benchmark	May 14, 2024	FormHuman-Object Interaction Detection	—Unverified	0
SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation	May 14, 2024	BenchmarkingMultiple-choice	CodeCode Available	1
MCS-SQL: Leveraging Multiple Prompts and Multiple-Choice Selection For Text-to-SQL Generation	May 13, 2024	In-Context LearningMultiple-choice	—Unverified	0
Limited Ability of LLMs to Simulate Human Psychological Behaviours: a Psychometric Analysis	May 12, 2024	Multiple-choiceQuestion Answering	CodeCode Available	0
THRONE: An Object-based Hallucination Benchmark for the Free-form Generations of Large Vision-Language Models	May 8, 2024	AttributeData Augmentation	CodeCode Available	1
WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning	May 6, 2024	Multiple-choiceVideo Understanding	—Unverified	0
Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice Questions	May 6, 2024	Decision MakingMultiple-choice	CodeCode Available	0
Self-Reflection in LLM Agents: Effects on Problem-Solving Performance	May 5, 2024	Multiple-choice	CodeCode Available	2
Math Multiple Choice Question Generation via Human-Large Language Model Collaboration	May 1, 2024	Language ModelingLanguage Modelling	—Unverified	0

Show:10 25 50

← PrevPage 10 of 23Next →

No leaderboard results yet.