| Controlling Cloze-test Question Item Difficulty with PLM-based Surrogate Models for IRT Assessment | Mar 3, 2024 | Cloze TestMultiple-choice | —Unverified | 0 |
| ParallelPARC: A Scalable Pipeline for Generating Natural-Language Analogies | Mar 2, 2024 | Multiple-choice | CodeCode Available | 1 |
| Predictions from language models for multiple-choice tasks are not robust under variation of scoring methods | Mar 1, 2024 | Multiple-choice | —Unverified | 0 |
| NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese Journalism | Feb 29, 2024 | EthicsMultiple-choice | CodeCode Available | 1 |
| Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions | Feb 28, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 1 |
| NextLevelBERT: Masked Language Modeling with Higher-Level Representations for Long Documents | Feb 27, 2024 | Document ClassificationLanguage Modeling | CodeCode Available | 1 |
| Unsupervised multiple choices question answering via universal corpus | Feb 27, 2024 | FormKnowledge Graphs | —Unverified | 0 |
| Leveraging Large Language Models for Learning Complex Legal Concepts through Storytelling | Feb 26, 2024 | Multiple-choice | CodeCode Available | 1 |
| Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models | Feb 26, 2024 | Multiple-choice | CodeCode Available | 1 |
| MoZIP: A Multilingual Benchmark to Evaluate Large Language Models in Intellectual Property | Feb 26, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |