| Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam Dataset | Jun 5, 2023 | BenchmarkingMultiple-choice | CodeCode Available | 1 |
| Conformal Prediction with Large Language Models for Multi-Choice Question Answering | May 28, 2023 | Conformal PredictionMultiple-choice | CodeCode Available | 1 |
| NarrativeXL: A Large-scale Dataset For Long-Term Memory Models | May 23, 2023 | Multiple-choiceReading Comprehension | CodeCode Available | 1 |
| VNHSGE: VietNamese High School Graduation Examination Dataset for Large Language Models | May 20, 2023 | Multiple-choiceQuestion Answering | CodeCode Available | 1 |
| M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark for Chinese Large Language Models | May 17, 2023 | Instruction FollowingMultiple-choice | CodeCode Available | 1 |
| Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting | May 7, 2023 | Multiple-choice | CodeCode Available | 1 |
| MindGames: Targeting Theory of Mind in Large Language Models with Dynamic Epistemic Modal Logic | May 5, 2023 | Epistemic ReasoningLanguage Modeling | CodeCode Available | 1 |
| Evaluating GPT-3.5 and GPT-4 Models on Brazilian University Admission Exams | Mar 29, 2023 | Multiple-choice | CodeCode Available | 1 |
| Explicit Planning Helps Language Models in Logical Reasoning | Mar 28, 2023 | Logical ReasoningMultiple-choice | CodeCode Available | 1 |
| Long Horizon Temperature Scaling | Feb 7, 2023 | Multiple-choice | CodeCode Available | 1 |