SOTAVerified

Multiple-choice

Papers

Showing 126150 of 1107 papers

TitleStatusHype
MindGames: Targeting Theory of Mind in Large Language Models with Dynamic Epistemic Modal LogicCode1
Fine-tuning Multimodal Large Language Models for Product BundlingCode1
Fake Alignment: Are LLMs Really Aligned Well?Code1
FaceXBench: Evaluating Multimodal LLMs on Face UnderstandingCode1
FarsTail: A Persian Natural Language Inference DatasetCode1
Explicit Planning Helps Language Models in Logical ReasoningCode1
Ranked Voting based Self-Consistency of Large Language ModelsCode1
FETA: A Benchmark for Few-Sample Task Transfer in Open-Domain DialogueCode1
An Open Source Data Contamination Report for Large Language ModelsCode1
Annealed Winner-Takes-All for Motion ForecastingCode1
ExplaGraphs: An Explanation Graph Generation Task for Structured Commonsense ReasoningCode1
Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealingCode1
Evaluating the Knowledge Dependency of QuestionsCode1
Explaining NLP Models via Minimal Contrastive Editing (MiCE)Code1
Filter-then-Generate: Large Language Models with Structure-Text Adapter for Knowledge Graph CompletionCode1
HCQA @ Ego4D EgoSchema Challenge 2024Code1
An MRC Framework for Semantic Role LabelingCode1
African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object ClassificationCode1
Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model EvaluationCode1
An In-depth Look at Gemini's Language AbilitiesCode1
Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive FrameworkCode1
Enhancing Knowledge Tracing with Concept Map and Response DisentanglementCode1
Evaluating GPT-3.5 and GPT-4 Models on Brazilian University Admission ExamsCode1
EduQG: A Multi-format Multiple Choice Dataset for the Educational DomainCode1
Do Large Language Models Understand Conversational Implicature -- A case study with a chinese sitcomCode1
Show:102550
← PrevPage 6 of 45Next →

No leaderboard results yet.