| Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models | Nov 10, 2023 | GSM8KMemorization | CodeCode Available | 1 |
| Fake Alignment: Are LLMs Really Aligned Well? | Nov 10, 2023 | Multiple-choice | CodeCode Available | 1 |
| Characterizing Large Language Models as Rationalizers of Knowledge-intensive Tasks | Nov 9, 2023 | Multiple-choiceWorld Knowledge | —Unverified | 0 |
| Assessing Distractors in Multiple-Choice Tests | Nov 8, 2023 | DiversityMultiple-choice | —Unverified | 0 |
| Evaluating multiple large language models in pediatric ophthalmology | Nov 7, 2023 | Multiple-choice | —Unverified | 0 |
| Evaluating the Potential of Leading Large Language Models in Reasoning Biology Questions | Nov 5, 2023 | Logical ReasoningMultiple-choice | —Unverified | 0 |
| More Robots are Coming: Large Multimodal Models (ChatGPT) can Solve Visually Diverse Images of Parsons Problems | Nov 3, 2023 | Multiple-choice | —Unverified | 0 |
| CASE: Commonsense-Augmented Score with an Expanded Answer Space | Nov 3, 2023 | Multiple-choice | CodeCode Available | 0 |
| Resilient Multiple Choice Learning: A learned scoring scheme with application to audio scene analysis | Nov 2, 2023 | Density EstimationDiversity | CodeCode Available | 1 |
| An Open Source Data Contamination Report for Large Language Models | Oct 26, 2023 | HellaSwagLanguage Modeling | CodeCode Available | 1 |