SOTAVerified

Multiple-choice

Papers

Showing 151200 of 1107 papers

TitleStatusHype
An In-depth Look at Gemini's Language AbilitiesCode1
General-Purpose Question-Answering with MacawCode1
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?Code1
FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food CultureCode1
Fool Your (Vision and) Language Model With Embarrassingly Simple PermutationsCode1
Benchmarking AI scientists in omics data-driven biological researchCode1
An MRC Framework for Semantic Role LabelingCode1
Benchmarking Large Language Models on Answering and Explaining Challenging Medical QuestionsCode1
Generating Distractors for Reading Comprehension Questions from Real ExaminationsCode1
Fine-tuning Multimodal Large Language Models for Product BundlingCode1
Filter-then-Generate: Large Language Models with Structure-Text Adapter for Knowledge Graph CompletionCode1
FarsTail: A Persian Natural Language Inference DatasetCode1
A Fine-tuning Dataset and Benchmark for Large Language Models for Protein UnderstandingCode1
IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language ModelsCode1
An Open Source Data Contamination Report for Large Language ModelsCode1
FETA: A Benchmark for Few-Sample Task Transfer in Open-Domain DialogueCode1
Ranked Voting based Self-Consistency of Large Language ModelsCode1
IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Language Models in E-commerceCode1
Explicit Planning Helps Language Models in Logical ReasoningCode1
JMedLoRA:Medical Domain Adaptation on Japanese Large Language Models using Instruction-tuningCode1
Language Model Uncertainty Quantification with Attention ChainCode1
Is Bigger and Deeper Always Better? Probing LLaMA Across Scales and LayersCode1
FaceXBench: Evaluating Multimodal LLMs on Face UnderstandingCode1
Leaf: Multiple-Choice Question GenerationCode1
ExplaGraphs: An Explanation Graph Generation Task for Structured Commonsense ReasoningCode1
Explaining NLP Models via Minimal Contrastive Editing (MiCE)Code1
Fake Alignment: Are LLMs Really Aligned Well?Code1
LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language ModelsCode1
HCQA @ Ego4D EgoSchema Challenge 2024Code1
Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model EvaluationCode1
LLMs Are Biased Towards Output Formats! Systematically Evaluating and Mitigating Output Format Bias of LLMsCode1
BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and LanguagesCode1
Enhancing Knowledge Tracing with Concept Map and Response DisentanglementCode1
Evaluating GPT-3.5 and GPT-4 Models on Brazilian University Admission ExamsCode1
Long Horizon Temperature ScalingCode1
A Few More Examples May Be Worth Billions of ParametersCode1
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model EvaluationCode1
BRAINTEASER: Lateral Thinking Puzzles for Large Language ModelsCode1
Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive FrameworkCode1
Bridging Video-text Retrieval with Multiple Choice QuestionsCode1
Evaluating language models as risk scoresCode1
Clues Before Answers: Generation-Enhanced Multiple-Choice QACode1
EduQG: A Multi-format Multiple Choice Dataset for the Educational DomainCode1
Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language ModelsCode1
E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language ModelsCode1
MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient EvaluationCode1
A BERT-based Distractor Generation Scheme with Multi-tasking and Negative Answer Training Strategies.Code1
AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language ModelsCode1
CC-Riddle: A Question Answering Dataset of Chinese Character RiddlesCode1
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language UnderstandingCode1
Show:102550
← PrevPage 4 of 23Next →

No leaderboard results yet.