SOTAVerified

Multiple-choice

Papers

Showing 101150 of 1107 papers

TitleStatusHype
ArabicMMLU: Assessing Massive Multitask Language Understanding in ArabicCode1
IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Language Models in E-commerceCode1
Knowledge Graph-Augmented Abstractive Summarization with Semantic-Driven Cloze RewardCode1
Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought PromptingCode1
IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian LanguagesCode1
Large Language Models Encode Clinical KnowledgeCode1
All Languages Matter: Evaluating LMMs on Culturally Diverse 100 LanguagesCode1
JMedLoRA:Medical Domain Adaptation on Japanese Large Language Models using Instruction-tuningCode1
Leveraging Large Language Models for Learning Complex Legal Concepts through StorytellingCode1
Leveraging Large Language Models for Multiple Choice Question AnsweringCode1
LLMs Are Biased Towards Output Formats! Systematically Evaluating and Mitigating Output Format Bias of LLMsCode1
Generating Distractors for Reading Comprehension Questions from Real ExaminationsCode1
Is Bigger and Deeper Always Better? Probing LLaMA Across Scales and LayersCode1
GPT as Knowledge Worker: A Zero-Shot Evaluation of (AI)CPA CapabilitiesCode1
From Machine Reading Comprehension to Dialogue State Tracking: Bridging the GapCode1
Fool Your (Vision and) Language Model With Embarrassingly Simple PermutationsCode1
Long Horizon Temperature ScalingCode1
General-Purpose Question-Answering with MacawCode1
GPT Takes the Bar ExamCode1
A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common Sense ReasoningCode1
Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam DatasetCode1
Assessing the Chemical Intelligence of Large Language ModelsCode1
BiMediX: Bilingual Medical Mixture of Experts LLMCode1
FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food CultureCode1
MILU: A Multi-task Indic Language Understanding BenchmarkCode1
MindGames: Targeting Theory of Mind in Large Language Models with Dynamic Epistemic Modal LogicCode1
Fine-tuning Multimodal Large Language Models for Product BundlingCode1
Fake Alignment: Are LLMs Really Aligned Well?Code1
FaceXBench: Evaluating Multimodal LLMs on Face UnderstandingCode1
FarsTail: A Persian Natural Language Inference DatasetCode1
Explicit Planning Helps Language Models in Logical ReasoningCode1
Ranked Voting based Self-Consistency of Large Language ModelsCode1
FETA: A Benchmark for Few-Sample Task Transfer in Open-Domain DialogueCode1
An Open Source Data Contamination Report for Large Language ModelsCode1
Annealed Winner-Takes-All for Motion ForecastingCode1
ExplaGraphs: An Explanation Graph Generation Task for Structured Commonsense ReasoningCode1
Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealingCode1
Evaluating the Knowledge Dependency of QuestionsCode1
Explaining NLP Models via Minimal Contrastive Editing (MiCE)Code1
Filter-then-Generate: Large Language Models with Structure-Text Adapter for Knowledge Graph CompletionCode1
HCQA @ Ego4D EgoSchema Challenge 2024Code1
An MRC Framework for Semantic Role LabelingCode1
African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object ClassificationCode1
Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model EvaluationCode1
An In-depth Look at Gemini's Language AbilitiesCode1
Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive FrameworkCode1
Enhancing Knowledge Tracing with Concept Map and Response DisentanglementCode1
Evaluating GPT-3.5 and GPT-4 Models on Brazilian University Admission ExamsCode1
EduQG: A Multi-format Multiple Choice Dataset for the Educational DomainCode1
Do Large Language Models Understand Conversational Implicature -- A case study with a chinese sitcomCode1
Show:102550
← PrevPage 3 of 23Next →

No leaderboard results yet.