SOTAVerified

Multiple-choice

Papers

Showing 426450 of 1107 papers

TitleStatusHype
DetectBench: Can Large Language Model Detect and Piece Together Implicit Evidence?Code0
Are Large Language Models Consistent over Value-laden Questions?Code0
Probabilities of Chat LLMs Are Miscalibrated but Still Predict Correctness on Multiple-Choice Q&ACode0
It's Not Easy Being Wrong: Large Language Models Struggle with Process of Elimination ReasoningCode0
Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language UnderstandingCode0
Kaleidoscope: In-language Exams for Massively Multilingual Vision EvaluationCode0
HSI: Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language ModelsCode0
IPEval: A Bilingual Intellectual Property Agency Consultation Evaluation Benchmark for Large Language ModelsCode0
Investigating the Shortcomings of LLMs in Step-by-Step Legal ReasoningCode0
iREL at SemEval-2024 Task 9: Improving Conventional Prompting Methods for Brain TeasersCode0
DefAn: Definitive Answer Dataset for LLMs Hallucination EvaluationCode0
Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice QuestionsCode0
StoryAnalogy: Deriving Story-level Analogies from Large Language Models to Unlock Analogical UnderstandingCode0
Introducing a framework to assess newly created questions with Natural Language ProcessingCode0
DE-COP: Detecting Copyrighted Content in Language Models Training DataCode0
An Automatic Question Usability Evaluation ToolkitCode0
Introducing Flexible Monotone Multiple Choice Item Response Theory Models and Bit ScalesCode0
Is Your Large Language Model Knowledgeable or a Choices-Only Cheater?Code0
A Profit-Maximizing Strategy for Advertising on the e-Commerce PlatformsCode0
Fusing Models with Complementary ExpertiseCode0
TAXI: Evaluating Categorical Knowledge Editing for Language ModelsCode0
Automated Generation and Tagging of Knowledge Components from Multiple-Choice QuestionsCode0
Chance-Constrained Multiple-Choice Knapsack Problem: Model, Algorithms, and ApplicationsCode0
Improving Question Answering with External KnowledgeCode0
DAHL: Domain-specific Automated Hallucination Evaluation of Long-Form Text through a Benchmark Dataset in BiomedicineCode0
Show:102550
← PrevPage 18 of 45Next →

No leaderboard results yet.