| Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation | Jan 6, 2025 | Language Model EvaluationLanguage Modeling | CodeCode Available | 1 | 5 |
| Leveraging Large Language Models for Learning Complex Legal Concepts through Storytelling | Feb 26, 2024 | Multiple-choice | CodeCode Available | 1 | 5 |
| ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic | Feb 20, 2024 | ArabicMMLULanguage Model Evaluation | CodeCode Available | 1 | 5 |
| FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture | Jun 16, 2024 | DiversityMultiple-choice | CodeCode Available | 1 | 5 |
| Constructing Narrative Event Evolutionary Graph for Script Event Prediction | May 14, 2018 | Graph Neural NetworkMultiple-choice | CodeCode Available | 1 | 5 |
| Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models | Feb 26, 2024 | Multiple-choice | CodeCode Available | 1 | 5 |
| Latxa: An Open Language Model and Evaluation Suite for Basque | Mar 29, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| A BERT-based Distractor Generation Scheme with Multi-tasking and Negative Answer Training Strategies. | Nov 1, 2020 | Distractor GenerationMultiple-choice | CodeCode Available | 1 | 5 |
| AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models | Feb 24, 2025 | Logical ReasoningMultiple-choice | CodeCode Available | 1 | 5 |
| CC-Riddle: A Question Answering Dataset of Chinese Character Riddles | Jun 28, 2022 | General KnowledgeLanguage Modelling | CodeCode Available | 1 | 5 |
| Conformal Prediction with Large Language Models for Multi-Choice Question Answering | May 28, 2023 | Conformal PredictionMultiple-choice | CodeCode Available | 1 | 5 |
| Leaf: Multiple-Choice Question Generation | Jan 22, 2022 | Multiple-choiceQuestion Answering | CodeCode Available | 1 | 5 |
| General-Purpose Question-Answering with Macaw | Sep 6, 2021 | Generative Question AnsweringMultiple-choice | CodeCode Available | 1 | 5 |
| SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models | Jul 20, 2023 | BenchmarkingLanguage Modeling | CodeCode Available | 1 | 5 |
| Leveraging Large Language Models for Multiple Choice Question Answering | Oct 22, 2022 | Answer SelectionMultiple-choice | CodeCode Available | 1 | 5 |
| GPT as Knowledge Worker: A Zero-Shot Evaluation of (AI)CPA Capabilities | Jan 11, 2023 | Multiple-choice | CodeCode Available | 1 | 5 |
| CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge | Nov 2, 2018 | Common Sense ReasoningMultiple-choice | CodeCode Available | 1 | 5 |
| CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training | Jun 15, 2024 | Domain AdaptationLanguage Modeling | CodeCode Available | 1 | 5 |
| Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting | May 7, 2023 | Multiple-choice | CodeCode Available | 1 | 5 |
| Complex Reasoning over Logical Queries on Commonsense Knowledge Graphs | Mar 12, 2024 | Knowledge GraphsMultiple-choice | CodeCode Available | 1 | 5 |
| CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models | Sep 5, 2023 | Code GenerationMultiple-choice | CodeCode Available | 1 | 5 |
| HCQA @ Ego4D EgoSchema Challenge 2024 | Jun 22, 2024 | Caption Generation | CodeCode Available | 1 | 5 |
| Knowledge Graph-Augmented Abstractive Summarization with Semantic-Driven Cloze Reward | May 3, 2020 | Abstractive Text SummarizationCloze Test | CodeCode Available | 1 | 5 |
| Language Model Uncertainty Quantification with Attention Chain | Mar 24, 2025 | Computational EfficiencyLanguage Modeling | CodeCode Available | 1 | 5 |
| CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning | Jan 25, 2024 | Multiple-choicePosition | CodeCode Available | 1 | 5 |