SOTAVerified

Multiple-choice

Papers

Showing 101150 of 1107 papers

TitleStatusHype
ArabicMMLU: Assessing Massive Multitask Language Understanding in ArabicCode1
Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language ModelsCode1
Delving into the Reversal Curse: How Far Can Large Language Models Generalize?Code1
Latxa: An Open Language Model and Evaluation Suite for BasqueCode1
CUPCase: Clinically Uncommon Patient Cases and Diagnoses DatasetCode1
AdaLoGN: Adaptive Logic Graph Network for Reasoning-Based Machine Reading ComprehensionCode1
All Languages Matter: Evaluating LMMs on Culturally Diverse 100 LanguagesCode1
Leveraging Large Language Models for Multiple Choice Question AnsweringCode1
LifeQA: A Real-life Dataset for Video Question AnsweringCode1
LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician ValidationCode1
Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across ModalitiesCode1
Large Language Models Encode Clinical KnowledgeCode1
LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual ContextsCode1
LongHealth: A Question Answering Benchmark with Long Clinical DocumentsCode1
Constructing Narrative Event Evolutionary Graph for Script Event PredictionCode1
M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark for Chinese Large Language ModelsCode1
Conformal Prediction with Large Language Models for Multi-Choice Question AnsweringCode1
Complex Reasoning over Logical Queries on Commonsense Knowledge GraphsCode1
CommonsenseQA: A Question Answering Challenge Targeting Commonsense KnowledgeCode1
E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language ModelsCode1
CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language ModelsCode1
Assessing the Chemical Intelligence of Large Language ModelsCode1
Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive FrameworkCode1
Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model EvaluationCode1
Enhancing Knowledge Tracing with Concept Map and Response DisentanglementCode1
MindGames: Targeting Theory of Mind in Large Language Models with Dynamic Epistemic Modal LogicCode1
IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian LanguagesCode1
CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language ModelsCode1
A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common Sense ReasoningCode1
CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-trainingCode1
CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and ReasoningCode1
Counterfactual Variable Control for Robust and Interpretable Question AnsweringCode1
InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video UnderstandingCode1
Clues Before Answers: Generation-Enhanced Multiple-Choice QACode1
Ranked Voting based Self-Consistency of Large Language ModelsCode1
Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language ModelsCode1
An Open Source Data Contamination Report for Large Language ModelsCode1
Annealed Winner-Takes-All for Motion ForecastingCode1
ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question AnsweringCode1
Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealingCode1
IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language ModelsCode1
INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in InsuranceCode1
An MRC Framework for Semantic Role LabelingCode1
African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object ClassificationCode1
GPT as Knowledge Worker: A Zero-Shot Evaluation of (AI)CPA CapabilitiesCode1
An In-depth Look at Gemini's Language AbilitiesCode1
Generating Distractors for Reading Comprehension Questions from Real ExaminationsCode1
GPT Takes the Bar ExamCode1
Fool Your (Vision and) Language Model With Embarrassingly Simple PermutationsCode1
Can large language models reason about medical questions?Code1
Show:102550
← PrevPage 3 of 23Next →

No leaderboard results yet.