| Controlling Cloze-test Question Item Difficulty with PLM-based Surrogate Models for IRT Assessment | Mar 3, 2024 | Cloze TestMultiple-choice | —Unverified | 0 |
| ParallelPARC: A Scalable Pipeline for Generating Natural-Language Analogies | Mar 2, 2024 | Multiple-choice | CodeCode Available | 1 |
| Predictions from language models for multiple-choice tasks are not robust under variation of scoring methods | Mar 1, 2024 | Multiple-choice | —Unverified | 0 |
| NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese Journalism | Feb 29, 2024 | EthicsMultiple-choice | CodeCode Available | 1 |
| Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions | Feb 28, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 1 |
| NextLevelBERT: Masked Language Modeling with Higher-Level Representations for Long Documents | Feb 27, 2024 | Document ClassificationLanguage Modeling | CodeCode Available | 1 |
| Unsupervised multiple choices question answering via universal corpus | Feb 27, 2024 | FormKnowledge Graphs | —Unverified | 0 |
| Leveraging Large Language Models for Learning Complex Legal Concepts through Storytelling | Feb 26, 2024 | Multiple-choice | CodeCode Available | 1 |
| Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models | Feb 26, 2024 | Multiple-choice | CodeCode Available | 1 |
| MoZIP: A Multilingual Benchmark to Evaluate Large Language Models in Intellectual Property | Feb 26, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| SportQA: A Benchmark for Sports Understanding in Large Language Models | Feb 24, 2024 | Few-Shot LearningMultiple-choice | CodeCode Available | 1 |
| Biomedical Entity Linking as Multiple Choice Question Answering | Feb 23, 2024 | Entity LinkingMultiple-choice | CodeCode Available | 0 |
| ToMBench: Benchmarking Theory of Mind in Large Language Models | Feb 23, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 2 |
| tinyBenchmarks: evaluating LLMs with fewer examples | Feb 22, 2024 | MMLUMultiple-choice | CodeCode Available | 2 |
| Uncertainty-Aware Evaluation for Vision-Language Models | Feb 22, 2024 | Conformal PredictionLanguage Modeling | CodeCode Available | 1 |
| Identifying Multiple Personalities in Large Language Models with External Evaluation | Feb 22, 2024 | Multiple-choice | —Unverified | 0 |
| "My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models | Feb 22, 2024 | Multiple-choiceText Generation | CodeCode Available | 0 |
| Ranking Large Language Models without Ground Truth | Feb 21, 2024 | Multiple-choiceTriplet | —Unverified | 0 |
| Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models | Feb 21, 2024 | Multiple-choice | —Unverified | 0 |
| KorNAT: LLM Alignment Benchmark for Korean Social Values and Common Knowledge | Feb 21, 2024 | 4kMultiple-choice | —Unverified | 0 |
| Digital Comprehensibility Assessment of Simplified Texts among Persons with Intellectual Disabilities | Feb 20, 2024 | Multiple-choiceText Simplification | —Unverified | 0 |
| BiMediX: Bilingual Medical Mixture of Experts LLM | Feb 20, 2024 | Mixture-of-ExpertsMultiple-choice | CodeCode Available | 1 |
| ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic | Feb 20, 2024 | ArabicMMLULanguage Model Evaluation | CodeCode Available | 1 |
| Probabilities of Chat LLMs Are Miscalibrated but Still Predict Correctness on Multiple-Choice Q&A | Feb 20, 2024 | Language ModellingLarge Language Model | CodeCode Available | 0 |
| Stick to your Role! Stability of Personal Values Expressed in Large Language Models | Feb 19, 2024 | Multiple-choice | —Unverified | 0 |
| Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question? | Feb 19, 2024 | Decision MakingMemorization | CodeCode Available | 0 |
| Uncertainty quantification in fine-tuned LLMs using LoRA ensembles | Feb 19, 2024 | Multiple-choiceUncertainty Quantification | CodeCode Available | 0 |
| KMMLU: Measuring Massive Multitask Language Understanding in Korean | Feb 18, 2024 | kmmluLanguage Model Evaluation | —Unverified | 0 |
| Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering | Feb 16, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| DE-COP: Detecting Copyrighted Content in Language Models Training Data | Feb 15, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge | Feb 12, 2024 | General KnowledgeMultiple-choice | CodeCode Available | 2 |
| The Effect of Sampling Temperature on Problem Solving in Large Language Models | Feb 7, 2024 | Multiple-choicePrompt Engineering | CodeCode Available | 1 |
| Prompting Implicit Discourse Relation Annotation | Feb 7, 2024 | ClassificationImplicit Discourse Relation Classification | —Unverified | 0 |
| SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models | Feb 7, 2024 | DiversityMultiple-choice | CodeCode Available | 2 |
| SceMQA: A Scientific College Entrance Level Multimodal Question Answering Benchmark | Feb 6, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Are Machines Better at Complex Reasoning? Unveiling Human-Machine Inference Gaps in Entailment Verification | Feb 6, 2024 | BenchmarkingMultiple-choice | —Unverified | 0 |
| SHIELD : An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models | Feb 6, 2024 | AttributeFace Anti-Spoofing | CodeCode Available | 1 |
| Enhancing textual textbook question answering with large language models and retrieval augmented generation | Feb 5, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 0 |
| LLMs May Perform MCQA by Selecting the Least Incorrect Option | Feb 2, 2024 | Multiple-choiceMultiple Choice Question Answering (MCQA) | —Unverified | 0 |
| Distractor Generation in Multiple-Choice Tasks: A Survey of Methods, Datasets, and Evaluation | Feb 2, 2024 | Distractor GenerationMultiple-choice | —Unverified | 0 |
| When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards | Feb 1, 2024 | Answer SelectionLanguage Modeling | CodeCode Available | 0 |
| An Information-Theoretic Approach to Analyze NLP Classification Tasks | Feb 1, 2024 | Multiple-choiceReading Comprehension | CodeCode Available | 0 |
| I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBench | Jan 31, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 4 |
| E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models | Jan 29, 2024 | EthicsMultiple-choice | CodeCode Available | 1 |
| Evaluating LLM -- Generated Multimodal Diagnosis from Medical Images and Symptom Analysis | Jan 28, 2024 | Knowledge GraphsMedical Diagnosis | —Unverified | 0 |
| Improving Medical Reasoning through Retrieval and Self-Reflection with Retrieval-Augmented Large Language Models | Jan 27, 2024 | Medical Question AnsweringMultiple-choice | CodeCode Available | 2 |
| Towards Collective Superintelligence: Amplifying Group IQ using Conversational Swarms | Jan 25, 2024 | Decision MakingMultiple-choice | —Unverified | 0 |
| LongHealth: A Question Answering Benchmark with Long Clinical Documents | Jan 25, 2024 | Information RetrievalMultiple-choice | CodeCode Available | 1 |
| CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning | Jan 25, 2024 | Multiple-choicePosition | CodeCode Available | 1 |
| What Large Language Models Know and What People Think They Know | Jan 24, 2024 | ArticlesDecision Making | —Unverified | 0 |