| M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark for Chinese Large Language Models | May 17, 2023 | Instruction FollowingMultiple-choice | CodeCode Available | 1 |
| A quantitative study of NLP approaches to question difficulty estimation | May 17, 2023 | MathMultiple-choice | CodeCode Available | 0 |
| C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models | May 15, 2023 | Multiple-choice | CodeCode Available | 3 |
| EMBRACE: Evaluation and Modifications for Boosting RACE | May 15, 2023 | Machine Reading ComprehensionMultiple-choice | CodeCode Available | 0 |
| Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting | May 7, 2023 | Multiple-choice | CodeCode Available | 1 |
| MindGames: Targeting Theory of Mind in Large Language Models with Dynamic Epistemic Modal Logic | May 5, 2023 | Epistemic ReasoningLanguage Modeling | CodeCode Available | 1 |
| Contextual Response Interpretation for Automated Structured Interviews: A Case Study in Market Research | Apr 30, 2023 | MarketingMultiple-choice | —Unverified | 0 |
| Who's the Best Detective? LLMs vs. MLs in Detecting Incoherent Fourth Grade Math Answers | Apr 21, 2023 | MathMultiple-choice | —Unverified | 0 |
| Analyzing the Performance of ChatGPT in Cardiology and Vascular Pathologies | Apr 15, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Prompt Engineering and Calibration for Zero-Shot Commonsense Reasoning | Apr 14, 2023 | Multiple-choicePrompt Engineering | —Unverified | 0 |