| SCoRE: Benchmarking Long-Chain Reasoning in Commonsense Scenarios | Mar 8, 2025 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question Answering | Jun 6, 2024 | abstractive question answeringClinical Knowledge | CodeCode Available | 0 |
| Order-Independence Without Fine Tuning | Jun 4, 2024 | Language ModellingMultiple-choice | CodeCode Available | 0 |
| Towards Diverse Perspective Learning with Selection over Multiple Temporal Poolings | Mar 14, 2024 | Multiple-choiceTime Series | CodeCode Available | 0 |
| PROST: Physical Reasoning of Objects through Space and Time | Jun 7, 2021 | Multiple-choice | CodeCode Available | 0 |
| VEGAS: Towards Visually Explainable and Grounded Artificial Social Intelligence | Apr 3, 2025 | Multiple-choice | CodeCode Available | 0 |
| Evaluating Prompts Across Multiple Choice Tasks In a Zero-Shot Setting | Mar 29, 2022 | Multiple-choice | CodeCode Available | 0 |
| This Land is Your, My Land: Evaluating Geopolitical Biases in Language Models | May 24, 2023 | Language ModellingLarge Language Model | CodeCode Available | 0 |
| Evaluating the Instruction-following Abilities of Language Models using Knowledge Tasks | Oct 16, 2024 | Instruction FollowingMultiple-choice | CodeCode Available | 0 |
| Multi-class Hierarchical Question Classification for Multiple Choice Science Exams | Aug 15, 2019 | ClassificationGeneral Classification | CodeCode Available | 0 |