SOTAVerified

Multiple-choice

Papers

Showing 201250 of 1107 papers

TitleStatusHype
Can large language models reason about medical questions?Code1
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific ResearchCode1
Delving into the Reversal Curse: How Far Can Large Language Models Generalize?Code1
Long Horizon Temperature ScalingCode1
A Few More Examples May Be Worth Billions of ParametersCode1
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model EvaluationCode1
LongHealth: A Question Answering Benchmark with Long Clinical DocumentsCode1
InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video UnderstandingCode1
HCQA @ Ego4D EgoSchema Challenge 2024Code1
CC-Riddle: A Question Answering Dataset of Chinese Character RiddlesCode1
M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark for Chinese Large Language ModelsCode1
LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?Code1
LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual ContextsCode1
A BERT-based Distractor Generation Scheme with Multi-tasking and Negative Answer Training Strategies.Code1
AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language ModelsCode1
Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language ModelsCode1
ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question AnsweringCode1
IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian LanguagesCode1
Logic-Guided Data Augmentation and Regularization for Consistent Question AnsweringCode1
LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language ModelsCode1
Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across ModalitiesCode1
LifeQA: A Real-life Dataset for Video Question AnsweringCode1
Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language ModelsCode1
Leveraging Large Language Models for Multiple Choice Question AnsweringCode1
ARMAN: Pre-training with Semantically Selecting and Reordering of Sentences for Persian Abstractive SummarizationCode1
IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Language Models in E-commerceCode1
LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician ValidationCode1
Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive FrameworkCode1
Clues Before Answers: Generation-Enhanced Multiple-Choice QACode1
CUPCase: Clinically Uncommon Patient Cases and Diagnoses DatasetCode1
CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and ReasoningCode1
FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food CultureCode1
CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language ModelsCode1
CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language ModelsCode1
JMedLoRA:Medical Domain Adaptation on Japanese Large Language Models using Instruction-tuningCode1
Training Trajectories of Language Models Across ScalesCode1
Trajectory-wise Multiple Choice Learning for Dynamics Generalization in Reinforcement LearningCode1
TSQA: Tabular Scenario Based Question AnsweringCode1
CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-trainingCode1
A BERT-based Distractor Generation Scheme with Multi-tasking and Negative Answer Training StrategiesCode1
Counterfactual Variable Control for Robust and Interpretable Question AnsweringCode1
Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought PromptingCode1
CommonsenseQA: A Question Answering Challenge Targeting Commonsense KnowledgeCode1
Large Language Models Encode Clinical KnowledgeCode1
Complex Reasoning over Logical Queries on Commonsense Knowledge GraphsCode1
Assessing the Chemical Intelligence of Large Language ModelsCode1
Leveraging Large Language Models for Learning Complex Legal Concepts through StorytellingCode1
LLMs Are Biased Towards Output Formats! Systematically Evaluating and Mitigating Output Format Bias of LLMsCode1
Constructing Narrative Event Evolutionary Graph for Script Event PredictionCode1
Marathon: A Race Through the Realm of Long Context with Large Language ModelsCode1
Show:102550
← PrevPage 5 of 23Next →

No leaderboard results yet.