SOTAVerified

Multiple-choice

Papers

Showing 401450 of 1107 papers

TitleStatusHype
Adversarial Databases Improve Success in Retrieval-based Large Language Models0
TurkishMMLU: Measuring Massive Multitask Language Understanding in TurkishCode1
Fine-tuning Multimodal Large Language Models for Product BundlingCode1
MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models0
Uncertainty is Fragile: Manipulating Uncertainty in Large Language ModelsCode1
AstroMLab 1: Who Wins Astronomy Jeopardy!?0
NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models0
LAB-Bench: Measuring Capabilities of Language Models for Biology Research0
Leveraging large language models for nano synthesis mechanism explanation: solid foundations or mere conjectures?Code0
Evaluating Nuanced Bias in Large Language Model Free Response Answers0
Self-Recognition in Language ModelsCode0
ORAN-Bench-13K: An Open Source Benchmark for Assessing LLMs in Open Radio Access NetworksCode1
Can Model Uncertainty Function as a Proxy for Multiple-Choice Question Item Difficulty?Code0
LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual ContextsCode1
MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific UnderstandingCode2
Are Large Language Models Consistent over Value-laden Questions?Code0
CFinBench: A Comprehensive Chinese Financial Benchmark for Large Language Models0
Is Your Large Language Model Knowledgeable or a Choices-Only Cheater?Code0
MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient EvaluationCode1
InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video UnderstandingCode1
Changing Answer Order Can Decrease MMLU Accuracy0
Length Optimization in Conformal PredictionCode0
DiVERT: Distractor Generation with Variational Errors Represented as Text for Math Multiple-choice QuestionsCode0
VarBench: Robust Language Model Benchmarking Through Dynamic Variable PerturbationCode0
Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration0
HCQA @ Ego4D EgoSchema Challenge 2024Code1
African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object ClassificationCode1
SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages0
QRMeM: Unleash the Length Limitation through Question then Reflection Memory Mechanism0
ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real WorldCode2
Enhancing Distractor Generation for Multiple-Choice Questions with Retrieval Augmented Pretraining and Knowledge Graph Integration0
On the Principles behind Opinion Dynamics in Multi-Agent Systems of Large Language Models0
UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice QuestionsCode0
QOG:Question and Options Generation based on Language Model0
DetectBench: Can Large Language Model Detect and Piece Together Implicit Evidence?Code0
IPEval: A Bilingual Intellectual Property Agency Consultation Evaluation Benchmark for Large Language ModelsCode0
Aqulia-Med LLM: Pioneering Full-Process Open-Source Medical Language Models0
Grade Score: Quantifying LLM Performance in Option SelectionCode0
FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food CultureCode1
Balancing Rigor and Utility: Mitigating Cognitive Biases in Large Language Models for Multiple-Choice QuestionsCode0
VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment0
VCEval: Rethinking What is a Good Educational Video and How to Automatically Evaluate It0
CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-trainingCode1
CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language ModelsCode2
Evaluating ChatGPT-4 Vision on Brazil's National Undergraduate Computer Science ExamCode0
BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and LanguagesCode1
IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Language Models in E-commerceCode1
Bayesian Statistical Modeling with Predictors from LLMs0
AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models0
INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in InsuranceCode1
Show:102550
← PrevPage 9 of 23Next →

No leaderboard results yet.