SOTAVerified

Multiple-choice

Papers

Showing 201250 of 1107 papers

TitleStatusHype
Filter-then-Generate: Large Language Models with Structure-Text Adapter for Knowledge Graph CompletionCode1
Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language ModelsCode1
ArabicMMLU: Assessing Massive Multitask Language Understanding in ArabicCode1
OpsEval: A Comprehensive IT Operations Benchmark Suite for Large Language ModelsCode1
Option Tracing: Beyond Correctness Analysis in Knowledge TracingCode1
ORAN-Bench-13K: An Open Source Benchmark for Assessing LLMs in Open Radio Access NetworksCode1
Delving into the Reversal Curse: How Far Can Large Language Models Generalize?Code1
A Few More Examples May Be Worth Billions of ParametersCode1
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model EvaluationCode1
CC-Riddle: A Question Answering Dataset of Chinese Character RiddlesCode1
Fake Alignment: Are LLMs Really Aligned Well?Code1
Explicit Planning Helps Language Models in Logical ReasoningCode1
R2DE: a NLP approach to estimating IRT parameters of newly generated questionsCode1
Resilient Multiple Choice Learning: A learned scoring scheme with application to audio scene analysisCode1
A BERT-based Distractor Generation Scheme with Multi-tasking and Negative Answer Training Strategies.Code1
AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language ModelsCode1
Explaining NLP Models via Minimal Contrastive Editing (MiCE)Code1
SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure InterpretationCode1
FaceXBench: Evaluating Multimodal LLMs on Face UnderstandingCode1
Evaluating language models as risk scoresCode1
Evaluating GPT-3.5 and GPT-4 Models on Brazilian University Admission ExamsCode1
SHIELD : An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language ModelsCode1
LLM-Coordination: Evaluating and Analyzing Multi-agent Coordination Abilities in Large Language ModelsCode1
Enhancing Knowledge Tracing with Concept Map and Response DisentanglementCode1
Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive FrameworkCode1
Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model EvaluationCode1
Evaluating the Knowledge Dependency of QuestionsCode1
Taming Overconfidence in LLMs: Reward Calibration in RLHFCode1
Clues Before Answers: Generation-Enhanced Multiple-Choice QACode1
EduQG: A Multi-format Multiple Choice Dataset for the Educational DomainCode1
E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language ModelsCode1
HyKGE: A Hypothesis Knowledge Graph Enhanced Framework for Accurate and Reliable Medical LLMs ResponsesCode1
TIMEDIAL: Temporal Commonsense Reasoning in DialogCode1
CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language ModelsCode1
CUPCase: Clinically Uncommon Patient Cases and Diagnoses DatasetCode1
ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking Theory of MindCode1
Do Large Language Models Understand Conversational Implicature -- A case study with a chinese sitcomCode1
TourSynbio: A Multi-Modal Large Model and Agent Framework to Bridge Text and Protein Sequences for Protein EngineeringCode1
CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-trainingCode1
A BERT-based Distractor Generation Scheme with Multi-tasking and Negative Answer Training StrategiesCode1
TSQA: Tabular Scenario Based Question AnsweringCode1
TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic ScenesCode1
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language UnderstandingCode1
Uncertainty is Fragile: Manipulating Uncertainty in Large Language ModelsCode1
Complex Reasoning over Logical Queries on Commonsense Knowledge GraphsCode1
Assessing the Chemical Intelligence of Large Language ModelsCode1
Unsupervised Commonsense Question Answering with Self-TalkCode1
Conformal Prediction with Large Language Models for Multi-Choice Question AnsweringCode1
ExplaGraphs: An Explanation Graph Generation Task for Structured Commonsense ReasoningCode1
IRLBench: A Multi-modal, Culturally Grounded, Parallel Irish-English Benchmark for Open-Ended LLM Reasoning EvaluationCode1
Show:102550
← PrevPage 5 of 23Next →

No leaderboard results yet.