SOTAVerified

Multiple-choice

Papers

Showing 101150 of 1107 papers

TitleStatusHype
ArabicMMLU: Assessing Massive Multitask Language Understanding in ArabicCode1
INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in InsuranceCode1
JMedLoRA:Medical Domain Adaptation on Japanese Large Language Models using Instruction-tuningCode1
Knowledge Graph-Augmented Abstractive Summarization with Semantic-Driven Cloze RewardCode1
AdaLoGN: Adaptive Logic Graph Network for Reasoning-Based Machine Reading ComprehensionCode1
Language Model Uncertainty Quantification with Attention ChainCode1
BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and LanguagesCode1
Leaf: Multiple-Choice Question GenerationCode1
Let Androids Dream of Electric Sheep: A Human-like Image Implication Understanding and Reasoning FrameworkCode1
Leveraging Large Language Models for Learning Complex Legal Concepts through StorytellingCode1
LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician ValidationCode1
GPT as Knowledge Worker: A Zero-Shot Evaluation of (AI)CPA CapabilitiesCode1
Logic-Guided Data Augmentation and Regularization for Consistent Question AnsweringCode1
LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?Code1
GPT Takes the Bar ExamCode1
LongHealth: A Question Answering Benchmark with Long Clinical DocumentsCode1
General-Purpose Question-Answering with MacawCode1
From Machine Reading Comprehension to Dialogue State Tracking: Bridging the GapCode1
Generating Distractors for Reading Comprehension Questions from Real ExaminationsCode1
Fine-tuning Multimodal Large Language Models for Product BundlingCode1
A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common Sense ReasoningCode1
Assessing the Chemical Intelligence of Large Language ModelsCode1
FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food CultureCode1
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific ResearchCode1
BiMediX: Bilingual Medical Mixture of Experts LLMCode1
MindGames: Targeting Theory of Mind in Large Language Models with Dynamic Epistemic Modal LogicCode1
Fool Your (Vision and) Language Model With Embarrassingly Simple PermutationsCode1
HCQA @ Ego4D EgoSchema Challenge 2024Code1
Fake Alignment: Are LLMs Really Aligned Well?Code1
FaceXBench: Evaluating Multimodal LLMs on Face UnderstandingCode1
FarsTail: A Persian Natural Language Inference DatasetCode1
Explicit Planning Helps Language Models in Logical ReasoningCode1
Ranked Voting based Self-Consistency of Large Language ModelsCode1
FETA: A Benchmark for Few-Sample Task Transfer in Open-Domain DialogueCode1
An Open Source Data Contamination Report for Large Language ModelsCode1
Annealed Winner-Takes-All for Motion ForecastingCode1
ExplaGraphs: An Explanation Graph Generation Task for Structured Commonsense ReasoningCode1
Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealingCode1
Evaluating the Knowledge Dependency of QuestionsCode1
Explaining NLP Models via Minimal Contrastive Editing (MiCE)Code1
Filter-then-Generate: Large Language Models with Structure-Text Adapter for Knowledge Graph CompletionCode1
An MRC Framework for Semantic Role LabelingCode1
African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object ClassificationCode1
Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model EvaluationCode1
An In-depth Look at Gemini's Language AbilitiesCode1
Benchmarking Large Language Models on Answering and Explaining Challenging Medical QuestionsCode1
Enhancing Knowledge Tracing with Concept Map and Response DisentanglementCode1
Evaluating GPT-3.5 and GPT-4 Models on Brazilian University Admission ExamsCode1
E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language ModelsCode1
Benchmarking AI scientists in omics data-driven biological researchCode1
Show:102550
← PrevPage 3 of 23Next →

No leaderboard results yet.