| An Automatic Question Usability Evaluation Toolkit | May 30, 2024 | Multiple-choiceWord Embeddings | CodeCode Available | 0 | 5 |
| KGQuiz: Evaluating the Generalization of Encoded Knowledge in Large Language Models | Oct 15, 2023 | Multiple-choiceTriplet | CodeCode Available | 0 | 5 |
| A Profit-Maximizing Strategy for Advertising on the e-Commerce Platforms | Oct 31, 2022 | ManagementMultiple-choice | CodeCode Available | 0 | 5 |
| Automated Generation and Tagging of Knowledge Components from Multiple-Choice Questions | May 30, 2024 | Language ModellingLarge Language Model | CodeCode Available | 0 | 5 |
| IPEval: A Bilingual Intellectual Property Agency Consultation Evaluation Benchmark for Large Language Models | Jun 18, 2024 | ManagementMultiple-choice | CodeCode Available | 0 | 5 |
| Chance-Constrained Multiple-Choice Knapsack Problem: Model, Algorithms, and Applications | Jun 26, 2023 | Combinatorial OptimizationMultiple-choice | CodeCode Available | 0 | 5 |
| iREL at SemEval-2024 Task 9: Improving Conventional Prompting Methods for Brain Teasers | May 25, 2024 | Common Sense ReasoningMultiple-choice | CodeCode Available | 0 | 5 |
| TRACE: Transformer-based Risk Assessment for Clinical Evaluation | Nov 13, 2024 | Decision MakingMissing Values | CodeCode Available | 0 | 5 |
| Introducing Flexible Monotone Multiple Choice Item Response Theory Models and Bit Scales | Oct 2, 2024 | Multiple-choice | CodeCode Available | 0 | 5 |
| DAHL: Domain-specific Automated Hallucination Evaluation of Long-Form Text through a Benchmark Dataset in Biomedicine | Nov 14, 2024 | FormHallucination | CodeCode Available | 0 | 5 |
| Investigating the Shortcomings of LLMs in Step-by-Step Legal Reasoning | Feb 8, 2025 | Legal ReasoningMultiple-choice | CodeCode Available | 0 | 5 |
| Improving Question Answering with External Knowledge | Feb 3, 2019 | ARCMultiple-choice | CodeCode Available | 0 | 5 |
| CSEPrompts: A Benchmark of Introductory Computer Science Prompts | Apr 3, 2024 | Multiple-choice | CodeCode Available | 0 | 5 |
| INCEPTNET: Precise And Early Disease Detection Application For Medical Images Analyses | Sep 5, 2023 | Cell DetectionLesion Segmentation | CodeCode Available | 0 | 5 |
| Improving Machine Reading Comprehension with General Reading Strategies | Oct 31, 2018 | ARCLanguage Modeling | CodeCode Available | 0 | 5 |
| AutoCast++: Enhancing World Event Prediction with Zero-shot Ranking-based Context Retrieval | Oct 3, 2023 | ArticlesDecision Making | CodeCode Available | 0 | 5 |
| QMOS: Enhancing LLMs for Telecommunication with Question Masked loss and Option Shuffling | Sep 21, 2024 | Multiple-choicePrompt Engineering | CodeCode Available | 0 | 5 |
| A multimodal dataset for understanding the impact of mobile phones on remote online virtual education | Dec 13, 2024 | EEGHead Pose Estimation | CodeCode Available | 0 | 5 |
| CRiskEval: A Chinese Multi-Level Risk Evaluation Benchmark Dataset for Large Language Models | Jun 7, 2024 | Multiple-choicePhilosophy | CodeCode Available | 0 | 5 |
| How Can We Diagnose and Treat Bias in Large Language Models for Clinical Decision-Making? | Oct 21, 2024 | counterfactualDecision Making | CodeCode Available | 0 | 5 |
| How much do LLMs learn from negative examples? | Mar 18, 2025 | Multiple-choiceQuestion Answering | CodeCode Available | 0 | 5 |
| A Benchmark for Long-Form Medical Question Answering | Nov 14, 2024 | Answer GenerationForm | CodeCode Available | 0 | 5 |
| Increasing Probability Mass on Answer Choices Does Not Always Improve Accuracy | May 24, 2023 | In-Context LearningMultiple-choice | CodeCode Available | 0 | 5 |
| Harnessing Structured Knowledge: A Concept Map-Based Approach for High-Quality Multiple Choice Question Generation with Effective Distractors | May 2, 2025 | High School PhysicsMisconceptions | CodeCode Available | 0 | 5 |
| Grounding Synthetic Data Evaluations of Language Models in Unsupervised Document Corpora | May 13, 2025 | BenchmarkingDiagnostic | CodeCode Available | 0 | 5 |