SOTAVerified

Multiple-choice

Papers

Showing 226250 of 1107 papers

TitleStatusHype
Enhancing Knowledge Tracing with Concept Map and Response DisentanglementCode1
Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive FrameworkCode1
Taming Overconfidence in LLMs: Reward Calibration in RLHFCode1
Clues Before Answers: Generation-Enhanced Multiple-Choice QACode1
Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model EvaluationCode1
Evaluating the Knowledge Dependency of QuestionsCode1
HyKGE: A Hypothesis Knowledge Graph Enhanced Framework for Accurate and Reliable Medical LLMs ResponsesCode1
TIMEDIAL: Temporal Commonsense Reasoning in DialogCode1
CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language ModelsCode1
EduQG: A Multi-format Multiple Choice Dataset for the Educational DomainCode1
ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking Theory of MindCode1
E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language ModelsCode1
TourSynbio: A Multi-Modal Large Model and Agent Framework to Bridge Text and Protein Sequences for Protein EngineeringCode1
CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-trainingCode1
A BERT-based Distractor Generation Scheme with Multi-tasking and Negative Answer Training StrategiesCode1
TSQA: Tabular Scenario Based Question AnsweringCode1
TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic ScenesCode1
Counterfactual Variable Control for Robust and Interpretable Question AnsweringCode1
Uncertainty is Fragile: Manipulating Uncertainty in Large Language ModelsCode1
Complex Reasoning over Logical Queries on Commonsense Knowledge GraphsCode1
Assessing the Chemical Intelligence of Large Language ModelsCode1
Unsupervised Commonsense Question Answering with Self-TalkCode1
Conformal Prediction with Large Language Models for Multi-Choice Question AnsweringCode1
Do Large Language Models Understand Conversational Implicature -- A case study with a chinese sitcomCode1
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language UnderstandingCode1
Show:102550
← PrevPage 10 of 45Next →

No leaderboard results yet.