SOTAVerified

Multiple-choice

Papers

Showing 451500 of 1107 papers

TitleStatusHype
MuirBench: A Comprehensive Benchmark for Robust Multi-image UnderstandingCode1
DefAn: Definitive Answer Dataset for LLMs Hallucination EvaluationCode0
OLMES: A Standard for Language Model Evaluations0
Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and ArenaCode2
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMsCode5
BertaQA: How Much Do Language Models Know About Local Culture?Code0
Towards a Personal Health Large Language Model0
Decision-Making Behavior Evaluation Framework for LLMs under Uncertain Context0
Investigating and Addressing Hallucinations of LLMs in Tasks Involving Negation0
Do LLMs Recognize me, When I is not me: Assessment of LLMs Understanding of Turkish Indexical Pronouns in Indexical Shift Contexts0
A Fine-tuning Dataset and Benchmark for Large Language Models for Protein UnderstandingCode1
LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMsCode0
CRiskEval: A Chinese Multi-Level Risk Evaluation Benchmark Dataset for Large Language ModelsCode0
M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question AnsweringCode0
Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?0
Every Answer Matters: Evaluating Commonsense with Probabilistic MeasuresCode0
Automating Turkish Educational Quiz Generation Using Large Language ModelsCode0
Order-Independence Without Fine TuningCode0
TopViewRS: Vision-Language Models as Top-View Spatial ReasonersCode1
Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical DataCode0
Explore then Determine: A GNN-LLM Synergy Framework for Reasoning over Knowledge Graph0
Strengthened Symbol Binding Makes Large Language Models Reliable Multiple-Choice SelectorsCode0
Evaluating Large Language Model Biases in Persona-Steered GenerationCode0
Student Answer Forecasting: Transformer-Driven Answer Choice Prediction for Language LearningCode0
An Automatic Question Usability Evaluation ToolkitCode0
Automated Generation and Tagging of Knowledge Components from Multiple-Choice QuestionsCode0
DGRC: An Effective Fine-tuning Framework for Distractor Generation in Chinese Multi-choice Reading Comprehension0
Edinburgh Clinical NLP at MEDIQA-CORR 2024: Guiding Large Language Models with Hints0
Can We Trust LLMs? Mitigate Overconfidence Bias in LLMs through Knowledge Transfer0
iREL at SemEval-2024 Task 9: Improving Conventional Prompting Methods for Brain TeasersCode0
Eliciting Informative Text Evaluations with Large Language ModelsCode0
Imagery as Inquiry: Exploring A Multimodal Dataset for Conversational Recommendation0
Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam GenerationCode2
Embedding Trajectory for Out-of-Distribution Detection in Mathematical ReasoningCode1
Robust portfolio optimization model for electronic coupon allocation0
Multiple-Choice Questions are Efficient and Robust LLM EvaluatorsCode1
Exploring the Capabilities of Prompted Large Language Models in Educational and Assessment Applications0
From Generalist to Specialist: Improving Large Language Models for Medical Physics Using ARCoT0
Benchmarking Large Language Models on CFLUE -- A Chinese Financial Language Understanding Evaluation DatasetCode3
COGNET-MD, an evaluation framework and dataset for Large Language Model benchmarks in the medical domain0
AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning0
CinePile: A Long Video Question Answering Dataset and Benchmark0
SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure InterpretationCode1
MCS-SQL: Leveraging Multiple Prompts and Multiple-Choice Selection For Text-to-SQL Generation0
Limited Ability of LLMs to Simulate Human Psychological Behaviours: a Psychometric AnalysisCode0
THRONE: An Object-based Hallucination Benchmark for the Free-form Generations of Large Vision-Language ModelsCode1
WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning0
Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice QuestionsCode0
Self-Reflection in LLM Agents: Effects on Problem-Solving PerformanceCode2
Math Multiple Choice Question Generation via Human-Large Language Model Collaboration0
Show:102550
← PrevPage 10 of 23Next →

No leaderboard results yet.